17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Overview

Decision SnapshotNeeds Validation

The dataset and LLM-judge are ready for early adoption in research pipelines, but expect domain gaps and verification steps before lab deployment.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

License: CC-BY

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Heegyu Kim, Taeyang Jeon, Seungtaek Choi, Ji Hoon Hong, Dong Won Jeon, Ga-Yeon Baek, Gyeong-Won Kwak, Dong-Hee Lee, Jisu Bae, Chihoon Lee, Yunseo Kim, Seon-Jin Choi, Jin-Seong Park, Sung Beom Cho, Hyunsouk Cho

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A legal, large-scale recipe dataset plus automated LLM judging reduces costly expert review and speeds development of ML tools that propose lab-ready synthesis steps.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The authors release Open Materials Guide (OMG), a 17K expert-verified dataset of materials synthesis recipes and AlchemyBench, an end-to-end benchmark that uses LLMs as automated judges to score synthesis predictions. They show LLM-based scoring (especially GPT-4o variants) aligns substantially better with expert judgments than standard text-overlap metrics, and that retrieval-augmented generation (RAG) improves recipe quality (best trade-off at K=5). The dataset, code, and evaluation prompts are public, but extraction and LLM-evaluation carry domain biases and inter-expert variability.

Problem Statement

Materials synthesis is still driven by trial-and-error and expert intuition. Existing text-mined datasets are often incomplete or noisy, and human expert evaluation of generated recipes is slow and costly, blocking large-scale development of ML tools for synthesis prediction.

Main Contribution

Open Materials Guide (OMG): 17K expert-verified synthesis recipes from open-access literature.

AlchemyBench: an end-to-end benchmark and task suite for synthesis prediction (materials, equipment, procedure, characterization).

Key Findings

Released a large, expert-verified synthesis dataset

Numbers17,667 recipes extracted (≈62% yield from 28,685 articles)

Practical UseYou can train and test synthesis models on a much larger, legally redistributable corpus instead of tiny, noisy datasets.

Evidence RefAbstract; Section 2.2

Expert verification shows high per-item quality but annotator variability

NumbersCompleteness 4.2/5, Correctness 4.7/5, Coherence 4.8/5; ICCs: completeness 0.695, correctness 0.258

Practical UseDataset entries are useful but expect noisy interpretations of minor details; add extra validation for critical parameters like stoichiometry or temperature.

Evidence RefTable 1; Section 2.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall score (o3-mini, high reasoning) on High Impact	3.759 ± 0.407	GPT-4o-Nov 3.709 ± 0.410	0.050	High Impact set	Table 5 (overall scores)	Table 5
Overall score (o3-mini, high) on Standard Impact	3.885 ± 0.377	o3-mini (high) on High Impact 3.759 ± 0.407	0.126	Standard Impact set	Table 5 (overall scores)	Table 5

What To Try In 7 Days

Download OMG and inspect recipes for your target domain to bootstrap a fine-tuning dataset.

Run a small RAG workflow (K=5) using your candidate LLM and retrieved similar recipes to see immediate quality gains.

Use GPT-4o-Aug as an automated judge to rank candidate recipes before sending top picks to a lab for validation.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseCC-BY

Code URLs

https://github.com/HeegyuKim/AlchemyBench

Data URLs

https://github.com/HeegyuKim/AlchemyBench

Risks & Boundaries

Limitations

Open-access sampling bias: overrepresents certain domains (e.g., batteries).

LLM extraction and evaluation can introduce subtle inaccuracies in stoichiometry and sequencing.

When Not To Use

For safety-critical or high-risk experimental recipes without independent expert review.

When legal access to proprietary or closed-access procedures is required.

Failure Modes

Hallucinated reagent quantities or incorrect temperatures in generated procedures.

Bias toward overrepresented synthesis methods in OMG (domain skew).

Core Entities

Models

GPT-4o-miniGPT-4o-AugGPT-4o-Novo3-mini

Metrics

BLEUROUGE-LBERTScorePearson correlationSpearman correlationICC (3,k)

Datasets

Open Materials Guide (OMG)AlchemyBench

Benchmarks

AlchemyBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Released a large, expert-verified synthesis dataset

Expert verification shows high per-item quality but annotator variability

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding