ChemLLM: a 7B chemistry-tuned LLM with ChemData (7M Q&A) and ChemBench (4.1k MCQs), matching GPT-4 on core chemical tasks

February 10, 20247 min

Overview

Decision SnapshotReady For Pilot

The work combines a large curated instruction dataset, a purpose-built benchmark, and a tuned 7B model; code and weights are released, but independent replication and downstream safety checks are needed before production deployment.

Citations40

Evidence Strength0.75

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/8

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, Dongzhan Zhou, Shufei Zhang, Mao Su, Han-Sen Zhong, Yuqiang Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A domain-tuned 7B model can match or beat much larger closed models on key chemistry tasks, enabling lower-cost deployment of chemistry assistants and search tools for R&D teams.

Who Should Care

Summary TLDR

The authors build ChemLLM, a chemistry-focused LLM based on InternLM2-Base-7B and tuned with a new instruction dataset (ChemData) plus general corpora. ChemData contains ~7M instruction Q&A across molecules, reactions and domain tasks. They also release ChemBench: 4,100 multiple-choice questions over nine chemistry tasks to reduce output-style bias. ChemLLM is trained with a two-stage pipeline (general → mixed chemical+general) using LoRA and DeepSpeed ZeRO++; authors report ChemLLM outperforms similar-size open models, exceeds GPT-3.5, and achieves parity or better than GPT-4 on most ChemBench tasks. They also report general-bench scores: MMLU 65.6, C-Eval 67.2, GSM8K 67.2, C-MHChem 76.4. D

Problem Statement

General LLMs lack chemistry-specific knowledge and struggle with structured chemical data (databases, SMILES). The field also lacks a fair benchmark that tests chemistry skills while avoiding output-style scoring bias. The paper aims to build a chemistry-tuned LLM, a large instruction dataset, and a multiple-choice benchmark to measure chemical competency.

Main Contribution

ChemData: a template-based instruction tuning dataset (authors report ~7M Q&A) converting structured chemical data into dialogue-style instructions.

ChemBench: a 4,100-item multiple-choice benchmark covering nine core molecule and reaction tasks to reduce evaluation bias from generative output style.

Key Findings

ChemData size and scope

Numbers7M instruction Q&A (authors' dataset summary)

Practical UseYou can fine-tune / evaluate 7M chemistry-style instruction examples to teach an LLM molecular and reaction tasks.

Evidence RefFig.2, main text ChemData section

ChemBench composition

Numbers4,100 multiple-choice questions across nine tasks

Practical UseUse this MCQ set to measure factual chemical knowledge while minimizing reward for writing style.

Evidence RefFig.2, main text ChemBench section

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ChemData size7M instruction Q&AChemDataAuthors' dataset summaryFig.2
ChemBench size4,100 multiple-choice questionsChemBenchMain text and Fig.2Fig.2

What To Try In 7 Days

Run ChemBench via OpenCompass to evaluate your in-house models quickly

Fine-tune a 7B open model with a small slice of ChemData templates for molecule tasks

Test ChemLLM translations and literature-extraction examples on your domain texts

Agent Features

Architectures
domain-specific LLMautoregressive transformer

Optimization Features

Infra Optimization
2 machines × 8 A100 GPUs (authors' cluster)
Model Optimization
LoRAmixed-precision (bfloat16)
System Optimization
SLURM distributed trainingper-GPU batch size 8 (total 128)
Training Optimization
DeepSpeed ZeRO++ (parameter slicing/offload)AdamW optimizer, linear LR decay with warmupNEFTune embedding noise (alpha=5)
Inference Optimization
KV cacheFlashAttention-2

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

ChemBench focuses on multiple-choice tasks and may not capture free-form generation issues or lab-safety risks

SMILES and IUPAC conversions can be brittle; template-generated dialog may carry template bias

When Not To Use

Do not use as a sole decision-maker for lab protocols or experimental execution without expert verification

Avoid trusting raw molecular designs or synthesis paths without chemical validation and simulation

Failure Modes

Incorrect SMILES/IUPAC generation or syntax errors

Overconfident but wrong property predictions or reaction outcomes

Core Entities

Models

ChemLLMInternLM2-Base-7BInternLM2-Chat-7BLLaMA2MistralChatGLM3QwenGPT-3.5GPT-4

Metrics

Accuracycross-entropy losstraining steps/epochs

Datasets

ChemDataMulti-CorpusFireFlyOpenOrcaUltraChatChemBenchMMLUC-EvalGSM8KC-MHChem

Benchmarks

ChemBenchMMLUC-EvalGSM8KC-MHChem