CrochetBench - Benchmarking Vision-Language Models in the Crochet Domain

About CrochetBench

CrochetBench is a benchmark for evaluating the ability of multimodal large language models to perform fine-grained, low-level procedural reasoning in the domain of crochet. Unlike prior benchmarks that focus on high-level description or visual question answering, CrochetBench shifts the emphasis from describing to doing: models are required to recognize stitches, select structurally appropriate instructions, generate crochet pattern instructions, and produce compilable crochet procedures.

🧵

Domain-Specific Language (DSL)
Adopts the CrochetPARADE DSL as an intermediate representation
✅

Structural Validation
Enables functional evaluation through program execution
📊

Comprehensive Tasks
Covers stitch classification, instruction grounding, and translation
🎯

Executable Correctness
Evaluates models on executable precision beyond surface-level similarity

Benchmark Tasks

Task A

Stitch Recognition

Detect symbolic primitives in crochet images

F1, Precision, Recall | 6,009 samples

Task B

Instruction Selection

Align visual evidence with candidate instructions

Accuracy | 6,003 samples

Task C

Instruction Generation

Generate natural language procedural instructions

BLEU, ROUGE, ChrF | 6,009 samples

Task D-step

Step-Level Formalization

Instruction-to-DSL translation for individual steps

Valid Pattern Rate | 119 samples

Task D-proj

Project-Level Formalization

Generate complete CrochetPARADE programs

Valid Pattern Rate | 100 samples

Models Evaluated

Open-Source

Salesforce BLIP-2 Flan-T5 XL

Open-Source

Google Gemma 3

Open-Source

Qwen2-VL

Open-Source

DeepSeek-VL

Closed-Source

GPT-4o

Closed-Source

Gemini 2.5 Flash-Lite

Closed-Source

Claude Sonnet 4

CrochetBench 🧶

About CrochetBench

Benchmark Tasks

Models Evaluated

Evaluate Your Models Today