A systematic benchmark showing where GPT-style LLMs help — and where they fail — on practical chemistry tasks

Overview

Decision SnapshotNeeds Validation

The benchmark uses public datasets and repeated trials; results reliably show strengths (classification, text) and weaknesses (SMILES generation), but closed-source model access and token limits constrain full replication.

Citations91

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 35%

Authors

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

This paper builds a practical benchmark of eight chemistry tasks (name translation, property prediction, yield prediction, reaction prediction, retrosynthesis, reagents selection, text-based molecule design, and molecule captioning) and evaluates five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama2-13B-chat, GAL-30B) in zero-shot and few-shot settings. Main findings: GPT-4 is the best generalist; LLMs do well at classification/ranking and language-style tasks but fail at SMILES-heavy generative tasks (reaction, retrosynthesis, name translation). In-context learning (ICL) with scaffold-based example retrieval and more examples consistently helps. The repo is available for replication.

Problem Statement

Can off-the-shelf large language models (LLMs) solve practical chemistry tasks, and which types of chemistry problems are they suitable for? The study tests LLMs across eight tasks to map strengths, limits, and prompting strategies.

Main Contribution

A public benchmark that evaluates LLMs on eight practical chemistry tasks using common datasets and metrics.

Systematic analysis of zero-shot vs few-shot (ICL) prompts, retrieval (random vs scaffold), and example counts.

Key Findings

GPT-4 ranks best across the eight chemistry tasks.

NumbersAverage rank: GPT-4 = 1.25 (Table 2).

Practical UseStart experiments with GPT-4 if using a general LLM for chemistry tasks; expect a measurable performance gap vs smaller LLMs.

Evidence RefTable 2; Section 4.2

LLMs perform poorly on generative tasks that require precise SMILES handling.

NumbersReaction prediction: GPT-4 Top-1 = 0.23 vs Chemformer baseline = 0.938 (Table 11).

Practical UseDo not use vanilla LLM outputs for product/reactant generation or name translation in production; use specialized models or tool pipelines (e.g., Chemformer, RDKit) instead.

Evidence RefTable 11; Sections 4.1 and 5.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 (Scaffold, k=20) = 0.230 ± 0.022; Chemformer baseline = 0.938	Chemformer (0.938)	−0.708	USPTO-MIT	Table 11	Table 11
Accuracy	GPT-4 (random, k=8): Buchwald-Hartwig = 0.800 ± 0.008; Suzuki = 0.764 ± 0.013	UAGNN: Buchwald-Hartwig 0.965; Suzuki 0.957	≈ −16 to −20 percentage points	Buchwald-Hartwig, Suzuki-Miyaura HTE	Table 10	Table 10

What To Try In 7 Days

Run GPT-4 few-shot prompts for reagent selection and quick yield triage using scaffold-based example retrieval.

Use LLMs to draft molecule descriptions or creative ideas, then filter candidates with RDKit or specialized property models.

Set up a guarded workflow: LLM proposal -> chemical validity checks (RDKit) -> human review for safety.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ChemFoundationModels/ChemLLMBench

Data URLs

https://github.com/ChemFoundationModels/ChemLLMBench (links to datasets and scripts)

Risks & Boundaries

Limitations

LLMs struggle to parse and generate exact SMILES and IUPAC names; they treat SMILES as text tokens rather than structured chemistry.

Evaluation metrics borrowed from NLP do not always reflect chemical utility (exact-match matters in chemistry).

When Not To Use

For production tasks that require exact SMILES outputs (retrosynthesis, reaction product generation, name translation).

When safety-critical or legally restricted chemical outputs could be produced without strong safeguards.

Failure Modes

Hallucinated molecules or chemical facts that look plausible but are chemically invalid.

High rate of invalid SMILES in zero-shot or poorly prompted runs (e.g., 17% invalid SMILES zero-shot reaction prediction).

Core Entities

Models

GPT-4GPT-3.5 (gpt-3.5-turbo)Davinci-003Llama2-13B-chatGAL-30B (Galactica)

Metrics

AccuracyF1BLEUExact MatchLevenshteinFCDValidityInvalid SMILES %ROUGEMETEOR

Datasets

BBBPHIVBACETox21ClinToxBuchwald-HartwigSuzuki-MiyauraUSPTO-MITUSPTO-50kChEBI-20PubChem

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4 ranks best across the eight chemistry tasks.

LLMs perform poorly on generative tasks that require precise SMILES handling.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding