A systematic benchmark showing where GPT-style LLMs help — and where they fail — on practical chemistry tasks

May 27, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.35

Cost Impact Score

0.5

Citation Count

91

Authors

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

Links

Abstract / PDF

Why It Matters For Business

LLMs can speed up human-in-the-loop chemistry tasks (text descriptions, candidate generation, reagent ranking) with few-shot prompts, but they are not yet reliable drop-in replacements for specialized models or automation pipelines where exact SMILES or reaction outcomes are needed.

Summary TLDR

This paper builds a practical benchmark of eight chemistry tasks (name translation, property prediction, yield prediction, reaction prediction, retrosynthesis, reagents selection, text-based molecule design, and molecule captioning) and evaluates five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama2-13B-chat, GAL-30B) in zero-shot and few-shot settings. Main findings: GPT-4 is the best generalist; LLMs do well at classification/ranking and language-style tasks but fail at SMILES-heavy generative tasks (reaction, retrosynthesis, name translation). In-context learning (ICL) with scaffold-based example retrieval and more examples consistently helps. The repo is available for replication.

Problem Statement

Can off-the-shelf large language models (LLMs) solve practical chemistry tasks, and which types of chemistry problems are they suitable for? The study tests LLMs across eight tasks to map strengths, limits, and prompting strategies.

Main Contribution

A public benchmark that evaluates LLMs on eight practical chemistry tasks using common datasets and metrics.

Systematic analysis of zero-shot vs few-shot (ICL) prompts, retrieval (random vs scaffold), and example counts.

Actionable findings: GPT-4 leads overall; LLMs are competitive for classification/ranking and text tasks but poor at SMILES-to-SMILES generation; ICL quality/quantity matters.

Key Findings

GPT-4 ranks best across the eight chemistry tasks.

NumbersAverage rank: GPT-4 = 1.25 (Table 2).

LLMs perform poorly on generative tasks that require precise SMILES handling.

NumbersReaction prediction: GPT-4 Top-1 = 0.23 vs Chemformer baseline = 0.938 (Table 11).

LLMs can be competitive on classification/ranking tasks when prompted well.

NumbersYield prediction: GPT-4 (random, k=8) accuracy 0.800 vs UAGNN 0.965 on Buchwald-Hartwig (Table 10).

In-context learning (ICL) reliably improves performance; scaffold retrieval and more examples help.

NumbersProperty BBBP: GPT-4 accuracy scaffold k=8 = 0.614 vs zero-shot = 0.476 (Table 7).

SELFIES is less effective than SMILES for current LLMs trained on general corpora.

NumbersProperty prediction (BBBP) F1: SMILES 0.587 vs SELFIES 0.541 (Table 16).

LLMs can generate chemically valid molecules but hallucinate chemical facts and may propose harmful compounds.

NumbersText-based molecule design validity >89% but exact-match <20% (Tables 14 & discussion).

Results

Accuracy

ValueGPT-4 (Scaffold, k=20) = 0.230 ± 0.022; Chemformer baseline = 0.938

BaselineChemformer (0.938)

Accuracy

ValueGPT-4 (random, k=8): Buchwald-Hartwig = 0.800 ± 0.008; Suzuki = 0.764 ± 0.013

BaselineUAGNN: Buchwald-Hartwig 0.965; Suzuki 0.957

Accuracy

ValueGPT-4 (Scaffold, k=8) Accuracy = 0.614 ± 0.016; F1 = 0.587 ± 0.018

BaselineXGBoost Accuracy = 0.850; RF F1 = 0.881

Text-based molecule design BLEU and Validity

ValueGPT-4 (Scaffold, k=10) BLEU = 0.816 ± 0.004; Validity = 0.888 ± 0.023; Exact match = 0.174 ± 0.029

BaselineMolT5-Large BLEU = 0.601; Validity = 0.940; Exact = 0.290

Molecule captioning BLEU-4

ValueGPT-4 (Scaffold, k=10) BLEU-4 = 0.365 ± 0.008; MolT5-Large = 0.383

BaselineMolT5-Large BLEU-4 = 0.383

Who Should Care

What To Try In 7 Days

Run GPT-4 few-shot prompts for reagent selection and quick yield triage using scaffold-based example retrieval.

Use LLMs to draft molecule descriptions or creative ideas, then filter candidates with RDKit or specialized property models.

Set up a guarded workflow: LLM proposal -> chemical validity checks (RDKit) -> human review for safety.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLMs struggle to parse and generate exact SMILES and IUPAC names; they treat SMILES as text tokens rather than structured chemistry.
  • Evaluation metrics borrowed from NLP do not always reflect chemical utility (exact-match matters in chemistry).
  • API token limits, query cost, and randomness limited experiment scale and hyperparameter sweeps.
  • Models tested include closed commercial models; performance may change with different/pretrained domain models.

When Not To Use

  • For production tasks that require exact SMILES outputs (retrosynthesis, reaction product generation, name translation).
  • When safety-critical or legally restricted chemical outputs could be produced without strong safeguards.
  • As a sole decision-maker for high-stakes yield optimization or synthesis planning without expert verification.

Failure Modes

  • Hallucinated molecules or chemical facts that look plausible but are chemically invalid.
  • High rate of invalid SMILES in zero-shot or poorly prompted runs (e.g., 17% invalid SMILES zero-shot reaction prediction).
  • Overreliance on label wording in prompts (models exploit label semantics rather than chemical structure).
  • Degraded performance on out-of-distribution or large SMILES strings tokenized into unhelpful subwords.

Core Entities

Models

  • GPT-4
  • GPT-3.5 (gpt-3.5-turbo)
  • Davinci-003
  • Llama2-13B-chat
  • GAL-30B (Galactica)

Metrics

  • Accuracy
  • F1
  • BLEU
  • Exact Match
  • Levenshtein
  • FCD
  • Validity
  • Invalid SMILES %
  • ROUGE
  • METEOR

Datasets

  • BBBP
  • HIV
  • BACE
  • Tox21
  • ClinTox
  • Buchwald-Hartwig
  • Suzuki-Miyaura
  • USPTO-MIT
  • USPTO-50k
  • ChEBI-20
  • PubChem