A systematic benchmark showing where GPT-style LLMs help — and where they fail — on practical chemistry tasks

May 27, 20238 min

Overview

Decision SnapshotNeeds Validation

The benchmark uses public datasets and repeated trials; results reliably show strengths (classification, text) and weaknesses (SMILES generation), but closed-source model access and token limits constrain full replication.

Citations91

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 40%

Novelty: 35%

Authors

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.

Who Should Care

Summary TLDR

This paper builds a practical benchmark of eight chemistry tasks (name translation, property prediction, yield prediction, reaction prediction, retrosynthesis, reagents selection, text-based molecule design, and molecule captioning) and evaluates five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama2-13B-chat, GAL-30B) in zero-shot and few-shot settings. Main findings: GPT-4 is the best generalist; LLMs do well at classification/ranking and language-style tasks but fail at SMILES-heavy generative tasks (reaction, retrosynthesis, name translation). In-context learning (ICL) with scaffold-based example retrieval and more examples consistently helps. The repo is available for replication.

Problem Statement

Can off-the-shelf large language models (LLMs) solve practical chemistry tasks, and which types of chemistry problems are they suitable for? The study tests LLMs across eight tasks to map strengths, limits, and prompting strategies.

Main Contribution

A public benchmark that evaluates LLMs on eight practical chemistry tasks using common datasets and metrics.

Systematic analysis of zero-shot vs few-shot (ICL) prompts, retrieval (random vs scaffold), and example counts.

Key Findings

GPT-4 ranks best across the eight chemistry tasks.

NumbersAverage rank: GPT-4 = 1.25 (Table 2).

Practical UseStart experiments with GPT-4 if using a general LLM for chemistry tasks; expect a measurable performance gap vs smaller LLMs.

Evidence RefTable 2; Section 4.2

LLMs perform poorly on generative tasks that require precise SMILES handling.

NumbersReaction prediction: GPT-4 Top-1 = 0.23 vs Chemformer baseline = 0.938 (Table 11).

Practical UseDo not use vanilla LLM outputs for product/reactant generation or name translation in production; use specialized models or tool pipelines (e.g., Chemformer, RDKit) instead.

Evidence RefTable 11; Sections 4.1 and 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 (Scaffold, k=20) = 0.230 ± 0.022; Chemformer baseline = 0.938Chemformer (0.938)−0.708USPTO-MITTable 11Table 11
AccuracyGPT-4 (random, k=8): Buchwald-Hartwig = 0.800 ± 0.008; Suzuki = 0.764 ± 0.013UAGNN: Buchwald-Hartwig 0.965; Suzuki 0.957≈ −16 to −20 percentage pointsBuchwald-Hartwig, Suzuki-Miyaura HTETable 10Table 10

What To Try In 7 Days

Run GPT-4 few-shot prompts for reagent selection and quick yield triage using scaffold-based example retrieval.

Use LLMs to draft molecule descriptions or creative ideas, then filter candidates with RDKit or specialized property models.

Set up a guarded workflow: LLM proposal -> chemical validity checks (RDKit) -> human review for safety.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

LLMs struggle to parse and generate exact SMILES and IUPAC names; they treat SMILES as text tokens rather than structured chemistry.

Evaluation metrics borrowed from NLP do not always reflect chemical utility (exact-match matters in chemistry).

When Not To Use

For production tasks that require exact SMILES outputs (retrosynthesis, reaction product generation, name translation).

When safety-critical or legally restricted chemical outputs could be produced without strong safeguards.

Failure Modes

Hallucinated molecules or chemical facts that look plausible but are chemically invalid.

High rate of invalid SMILES in zero-shot or poorly prompted runs (e.g., 17% invalid SMILES zero-shot reaction prediction).

Core Entities

Models

GPT-4GPT-3.5 (gpt-3.5-turbo)Davinci-003Llama2-13B-chatGAL-30B (Galactica)

Metrics

AccuracyF1BLEUExact MatchLevenshteinFCDValidityInvalid SMILES %ROUGEMETEOR

Datasets

BBBPHIVBACETox21ClinToxBuchwald-HartwigSuzuki-MiyauraUSPTO-MITUSPTO-50kChEBI-20PubChem