17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

February 23, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Heegyu Kim, Taeyang Jeon, Seungtaek Choi, Ji Hoon Hong, Dong Won Jeon, Ga-Yeon Baek, Gyeong-Won Kwak, Dong-Hee Lee, Jisu Bae, Chihoon Lee, Yunseo Kim, Seon-Jin Choi, Jin-Seong Park, Sung Beom Cho, Hyunsouk Cho

Links

Abstract / PDF

Why It Matters For Business

A legal, large-scale recipe dataset plus automated LLM judging reduces costly expert review and speeds development of ML tools that propose lab-ready synthesis steps.

Summary TLDR

The authors release Open Materials Guide (OMG), a 17K expert-verified dataset of materials synthesis recipes and AlchemyBench, an end-to-end benchmark that uses LLMs as automated judges to score synthesis predictions. They show LLM-based scoring (especially GPT-4o variants) aligns substantially better with expert judgments than standard text-overlap metrics, and that retrieval-augmented generation (RAG) improves recipe quality (best trade-off at K=5). The dataset, code, and evaluation prompts are public, but extraction and LLM-evaluation carry domain biases and inter-expert variability.

Problem Statement

Materials synthesis is still driven by trial-and-error and expert intuition. Existing text-mined datasets are often incomplete or noisy, and human expert evaluation of generated recipes is slow and costly, blocking large-scale development of ML tools for synthesis prediction.

Main Contribution

Open Materials Guide (OMG): 17K expert-verified synthesis recipes from open-access literature.

AlchemyBench: an end-to-end benchmark and task suite for synthesis prediction (materials, equipment, procedure, characterization).

LLM-as-a-Judge framework: automated evaluation using LLMs with demonstrated statistical alignment to expert scores.

Empirical study of model types and RAG: comparison of GPT-4o variants and a reasoning model (o3-mini) with RAG (K up to 25).

Open release of dataset, prompts, and code to support reproducible research.

Key Findings

Released a large, expert-verified synthesis dataset

Numbers17,667 recipes extracted (≈62% yield from 28,685 articles)

Expert verification shows high per-item quality but annotator variability

NumbersCompleteness 4.2/5, Correctness 4.7/5, Coherence 4.8/5; ICCs: completeness 0.695, correctness 0.258

LLM-based evaluation aligns better with domain experts than lexical metrics

NumbersGPT-4o-Aug Pearson=0.80 (High experts) vs BLEU/ROUGE/BERTScore negative/low

Reasoning-capable models and RAG improve recipe quality

Numberso3-mini (high) mean score 3.759→4.001 at K=5 (High Impact set); GPT-4o-Nov improves up to K=25 to 3.976

Human evaluation is slow and costly

NumbersAverage 23 minutes per prediction (σ=7.57)

Results

Overall score (o3-mini, high reasoning) on High Impact

Value3.759 ± 0.407

BaselineGPT-4o-Nov 3.709 ± 0.410

Overall score (o3-mini, high) on Standard Impact

Value3.885 ± 0.377

Baselineo3-mini (high) on High Impact 3.759 ± 0.407

LLM-as-a-Judge correlation with experts (GPT-4o-Aug)

ValuePearson = 0.80 (High experts)

BaselineBLEU/ROUGE/BERTScore (≤0.06 or negative)

RAG effect (o3-mini-high)

ValueMean 3.759 → 4.001 at K=5

BaselineK=0 (no retrieval) mean 3.759

Extraction quality (expert scores)

ValueCompleteness 4.2/5; Correctness 4.7/5; Coherence 4.8/5

BaselineN/A

Who Should Care

What To Try In 7 Days

Download OMG and inspect recipes for your target domain to bootstrap a fine-tuning dataset.

Run a small RAG workflow (K=5) using your candidate LLM and retrieved similar recipes to see immediate quality gains.

Use GPT-4o-Aug as an automated judge to rank candidate recipes before sending top picks to a lab for validation.

Reproducibility

License

  • CC-BY

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Open-access sampling bias: overrepresents certain domains (e.g., batteries).
  • LLM extraction and evaluation can introduce subtle inaccuracies in stoichiometry and sequencing.
  • Inter-expert variability limits a single ground truth for nuanced procedure details.
  • LLM-based scoring may overlook practical lab constraints and interpretability needs.

When Not To Use

  • For safety-critical or high-risk experimental recipes without independent expert review.
  • When legal access to proprietary or closed-access procedures is required.
  • If you need fully interpretable decision traces for regulatory compliance.

Failure Modes

  • Hallucinated reagent quantities or incorrect temperatures in generated procedures.
  • Bias toward overrepresented synthesis methods in OMG (domain skew).
  • LLM judge sensitivity to prompt phrasing yielding inconsistent scores.

Core Entities

Models

  • GPT-4o-mini
  • GPT-4o-Aug
  • GPT-4o-Nov
  • o3-mini

Metrics

  • BLEU
  • ROUGE-L
  • BERTScore
  • Pearson correlation
  • Spearman correlation
  • ICC (3,k)

Datasets

  • Open Materials Guide (OMG)
  • AlchemyBench

Benchmarks

  • AlchemyBench