Overview
Production Readiness
0.4
Novelty Score
0.35
Cost Impact Score
0.5
Citation Count
91
Why It Matters For Business
LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.
Summary TLDR
This paper builds a practical benchmark of eight chemistry tasks (name translation, property prediction, yield prediction, reaction prediction, retrosynthesis, reagents selection, text-based molecule design, and molecule captioning) and evaluates five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama2-13B-chat, GAL-30B) in zero-shot and few-shot settings. Main findings: GPT-4 is the best generalist; LLMs do well at classification/ranking and language-style tasks but fail at SMILES-heavy generative tasks (reaction, retrosynthesis, name translation). In-context learning (ICL) with scaffold-based example retrieval and more examples consistently helps. The repo is available for replication.
Problem Statement
Can off-the-shelf large language models (LLMs) solve practical chemistry tasks, and which types of chemistry problems are they suitable for? The study tests LLMs across eight tasks to map strengths, limits, and prompting strategies.
Main Contribution
A public benchmark that evaluates LLMs on eight practical chemistry tasks using common datasets and metrics.
Systematic analysis of zero-shot vs few-shot (ICL) prompts, retrieval (random vs scaffold), and example counts.
Actionable findings: GPT-4 leads overall; LLMs are competitive for classification/ranking and text tasks but poor at SMILES-to-SMILES generation; ICL quality/quantity matters.
Key Findings
GPT-4 ranks best across the eight chemistry tasks.
LLMs perform poorly on generative tasks that require precise SMILES handling.
LLMs can be competitive on classification/ranking tasks when prompted well.
In-context learning (ICL) reliably improves performance; scaffold retrieval and more examples help.
SELFIES is less effective than SMILES for current LLMs trained on general corpora.
LLMs can generate chemically valid molecules but hallucinate chemical facts and may propose harmful compounds.
Results
Accuracy
Accuracy
Accuracy
Text-based molecule design BLEU and Validity
Molecule captioning BLEU-4
Who Should Care
What To Try In 7 Days
Run GPT-4 few-shot prompts for reagent selection and quick yield triage using scaffold-based example retrieval.
Use LLMs to draft molecule descriptions or creative ideas, then filter candidates with RDKit or specialized property models.
Set up a guarded workflow: LLM proposal -> chemical validity checks (RDKit) -> human review for safety.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLMs struggle to parse and generate exact SMILES and IUPAC names; they treat SMILES as text tokens rather than structured chemistry.
- Evaluation metrics borrowed from NLP do not always reflect chemical utility (exact-match matters in chemistry).
- API token limits, query cost, and randomness limited experiment scale and hyperparameter sweeps.
- Models tested include closed commercial models; performance may change with different/pretrained domain models.
When Not To Use
- For production tasks that require exact SMILES outputs (retrosynthesis, reaction product generation, name translation).
- When safety-critical or legally restricted chemical outputs could be produced without strong safeguards.
- As a sole decision-maker for high-stakes yield optimization or synthesis planning without expert verification.
Failure Modes
- Hallucinated molecules or chemical facts that look plausible but are chemically invalid.
- High rate of invalid SMILES in zero-shot or poorly prompted runs (e.g., 17% invalid SMILES zero-shot reaction prediction).
- Overreliance on label wording in prompts (models exploit label semantics rather than chemical structure).
- Degraded performance on out-of-distribution or large SMILES strings tokenized into unhelpful subwords.
Core Entities
Models
- GPT-4
- GPT-3.5 (gpt-3.5-turbo)
- Davinci-003
- Llama2-13B-chat
- GAL-30B (Galactica)
Metrics
- Accuracy
- F1
- BLEU
- Exact Match
- Levenshtein
- FCD
- Validity
- Invalid SMILES %
- ROUGE
- METEOR
Datasets
- BBBP
- HIV
- BACE
- Tox21
- ClinTox
- Buchwald-Hartwig
- Suzuki-Miyaura
- USPTO-MIT
- USPTO-50k
- ChEBI-20
- PubChem

