17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

February 23, 20257 min

Overview

Decision SnapshotNeeds Validation

The dataset and LLM-judge are ready for early adoption in research pipelines, but expect domain gaps and verification steps before lab deployment.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

License: CC-BY

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Heegyu Kim, Taeyang Jeon, Seungtaek Choi, Ji Hoon Hong, Dong Won Jeon, Ga-Yeon Baek, Gyeong-Won Kwak, Dong-Hee Lee, Jisu Bae, Chihoon Lee, Yunseo Kim, Seon-Jin Choi, Jin-Seong Park, Sung Beom Cho, Hyunsouk Cho

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A legal, large-scale recipe dataset plus automated LLM judging reduces costly expert review and speeds development of ML tools that propose lab-ready synthesis steps.

Who Should Care

Summary TLDR

The authors release Open Materials Guide (OMG), a 17K expert-verified dataset of materials synthesis recipes and AlchemyBench, an end-to-end benchmark that uses LLMs as automated judges to score synthesis predictions. They show LLM-based scoring (especially GPT-4o variants) aligns substantially better with expert judgments than standard text-overlap metrics, and that retrieval-augmented generation (RAG) improves recipe quality (best trade-off at K=5). The dataset, code, and evaluation prompts are public, but extraction and LLM-evaluation carry domain biases and inter-expert variability.

Problem Statement

Materials synthesis is still driven by trial-and-error and expert intuition. Existing text-mined datasets are often incomplete or noisy, and human expert evaluation of generated recipes is slow and costly, blocking large-scale development of ML tools for synthesis prediction.

Main Contribution

Open Materials Guide (OMG): 17K expert-verified synthesis recipes from open-access literature.

AlchemyBench: an end-to-end benchmark and task suite for synthesis prediction (materials, equipment, procedure, characterization).

Key Findings

Released a large, expert-verified synthesis dataset

Numbers17,667 recipes extracted (≈62% yield from 28,685 articles)

Practical UseYou can train and test synthesis models on a much larger, legally redistributable corpus instead of tiny, noisy datasets.

Evidence RefAbstract; Section 2.2

Expert verification shows high per-item quality but annotator variability

NumbersCompleteness 4.2/5, Correctness 4.7/5, Coherence 4.8/5; ICCs: completeness 0.695, correctness 0.258

Practical UseDataset entries are useful but expect noisy interpretations of minor details; add extra validation for critical parameters like stoichiometry or temperature.

Evidence RefTable 1; Section 2.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall score (o3-mini, high reasoning) on High Impact3.759 ± 0.407GPT-4o-Nov 3.709 ± 0.4100.050High Impact setTable 5 (overall scores)Table 5
Overall score (o3-mini, high) on Standard Impact3.885 ± 0.377o3-mini (high) on High Impact 3.759 ± 0.4070.126Standard Impact setTable 5 (overall scores)Table 5

What To Try In 7 Days

Download OMG and inspect recipes for your target domain to bootstrap a fine-tuning dataset.

Run a small RAG workflow (K=5) using your candidate LLM and retrieved similar recipes to see immediate quality gains.

Use GPT-4o-Aug as an automated judge to rank candidate recipes before sending top picks to a lab for validation.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseCC-BY

Risks & Boundaries

Limitations

Open-access sampling bias: overrepresents certain domains (e.g., batteries).

LLM extraction and evaluation can introduce subtle inaccuracies in stoichiometry and sequencing.

When Not To Use

For safety-critical or high-risk experimental recipes without independent expert review.

When legal access to proprietary or closed-access procedures is required.

Failure Modes

Hallucinated reagent quantities or incorrect temperatures in generated procedures.

Bias toward overrepresented synthesis methods in OMG (domain skew).

Core Entities

Models

GPT-4o-miniGPT-4o-AugGPT-4o-Novo3-mini

Metrics

BLEUROUGE-LBERTScorePearson correlationSpearman correlationICC (3,k)

Datasets

Open Materials Guide (OMG)AlchemyBench

Benchmarks

AlchemyBench