Overview
The benchmark uses public datasets and repeated trials; results reliably show strengths (classification, text) and weaknesses (SMILES generation), but closed-source model access and token limits constrain full replication.
Citations91
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 40%
Novelty: 35%
Why It Matters For Business
LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.
Who Should Care
Summary TLDR
This paper builds a practical benchmark of eight chemistry tasks (name translation, property prediction, yield prediction, reaction prediction, retrosynthesis, reagents selection, text-based molecule design, and molecule captioning) and evaluates five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama2-13B-chat, GAL-30B) in zero-shot and few-shot settings. Main findings: GPT-4 is the best generalist; LLMs do well at classification/ranking and language-style tasks but fail at SMILES-heavy generative tasks (reaction, retrosynthesis, name translation). In-context learning (ICL) with scaffold-based example retrieval and more examples consistently helps. The repo is available for replication.
Problem Statement
Can off-the-shelf large language models (LLMs) solve practical chemistry tasks, and which types of chemistry problems are they suitable for? The study tests LLMs across eight tasks to map strengths, limits, and prompting strategies.
Main Contribution
A public benchmark that evaluates LLMs on eight practical chemistry tasks using common datasets and metrics.
Systematic analysis of zero-shot vs few-shot (ICL) prompts, retrieval (random vs scaffold), and example counts.
Key Findings
GPT-4 ranks best across the eight chemistry tasks.
LLMs perform poorly on generative tasks that require precise SMILES handling.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 (Scaffold, k=20) = 0.230 ± 0.022; Chemformer baseline = 0.938 | Chemformer (0.938) | −0.708 | USPTO-MIT | Table 11 | Table 11 |
| Accuracy | GPT-4 (random, k=8): Buchwald-Hartwig = 0.800 ± 0.008; Suzuki = 0.764 ± 0.013 | UAGNN: Buchwald-Hartwig 0.965; Suzuki 0.957 | ≈ −16 to −20 percentage points | Buchwald-Hartwig, Suzuki-Miyaura HTE | Table 10 | Table 10 |
What To Try In 7 Days
Run GPT-4 few-shot prompts for reagent selection and quick yield triage using scaffold-based example retrieval.
Use LLMs to draft molecule descriptions or creative ideas, then filter candidates with RDKit or specialized property models.
Set up a guarded workflow: LLM proposal -> chemical validity checks (RDKit) -> human review for safety.
Reproducibility
Risks & Boundaries
Limitations
LLMs struggle to parse and generate exact SMILES and IUPAC names; they treat SMILES as text tokens rather than structured chemistry.
Evaluation metrics borrowed from NLP do not always reflect chemical utility (exact-match matters in chemistry).
When Not To Use
For production tasks that require exact SMILES outputs (retrosynthesis, reaction product generation, name translation).
When safety-critical or legally restricted chemical outputs could be produced without strong safeguards.
Failure Modes
Hallucinated molecules or chemical facts that look plausible but are chemically invalid.
High rate of invalid SMILES in zero-shot or poorly prompted runs (e.g., 17% invalid SMILES zero-shot reaction prediction).

