CrochetBench ๐Ÿงถ

Can vision-language models move from describing to doing in the crochet domain?

View on GitHub
12 Stars
1 Forks
2 Open Issues
12 Watchers

About CrochetBench

CrochetBench is a benchmark for evaluating the ability of multimodal large language models to perform fine-grained, low-level procedural reasoning in the domain of crochet. Unlike prior benchmarks that focus on high-level description or visual question answering, CrochetBench shifts the emphasis from describing to doing: models are required to recognize stitches, select structurally appropriate instructions, generate crochet pattern instructions, and produce compilable crochet procedures.

  • ๐Ÿงต
    Domain-Specific Language (DSL)
    Adopts the CrochetPARADE DSL as an intermediate representation
  • โœ…
    Structural Validation
    Enables functional evaluation through program execution
  • ๐Ÿ“Š
    Comprehensive Tasks
    Covers stitch classification, instruction grounding, and translation
  • ๐ŸŽฏ
    Executable Correctness
    Evaluates models on executable precision beyond surface-level similarity

Benchmark Tasks

Task A
Stitch Recognition
Detect symbolic primitives in crochet images
F1, Precision, Recall | 6,009 samples
Task B
Instruction Selection
Align visual evidence with candidate instructions
Accuracy | 6,003 samples
Task C
Instruction Generation
Generate natural language procedural instructions
BLEU, ROUGE, ChrF | 6,009 samples
Task D-step
Step-Level Formalization
Instruction-to-DSL translation for individual steps
Valid Pattern Rate | 119 samples
Task D-proj
Project-Level Formalization
Generate complete CrochetPARADE programs
Valid Pattern Rate | 100 samples

Models Evaluated

Open-Source
Salesforce BLIP-2 Flan-T5 XL
Open-Source
Google Gemma 3
Open-Source
Qwen2-VL
Open-Source
DeepSeek-VL
Closed-Source
GPT-4o
Closed-Source
Gemini 2.5 Flash-Lite
Closed-Source
Claude Sonnet 4

Evaluate Your Models Today

Join the community advancing procedural reasoning in creative domains

Get Started