ChemLLM: a 7B chemistry-tuned LLM with ChemData (7M Q&A) and ChemBench (4.1k MCQs), matching GPT-4 on core chemical tasks

Overview

Decision SnapshotReady For Pilot

The work combines a large curated instruction dataset, a purpose-built benchmark, and a tuned 7B model; code and weights are released, but independent replication and downstream safety checks are needed before production deployment.

Citations40

Evidence Strength0.75

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/8

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 70%

Authors

Di Zhang, Wei Liu, Qian Tan, Jingdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, Dongzhan Zhou, Shufei Zhang, Mao Su, Han-Sen Zhong, Yuqiang Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A domain-tuned 7B model can match or beat much larger closed models on key chemistry tasks, enabling lower-cost deployment of chemistry assistants and search tools for R&D teams.

Who Should Care

ML Engineer Data Scientist CTO Founder

Summary TLDR

The authors build ChemLLM, a chemistry-focused LLM based on InternLM2-Base-7B and tuned with a new instruction dataset (ChemData) plus general corpora. ChemData contains ~7M instruction Q&A across molecules, reactions and domain tasks. They also release ChemBench: 4,100 multiple-choice questions over nine chemistry tasks to reduce output-style bias. ChemLLM is trained with a two-stage pipeline (general → mixed chemical+general) using LoRA and DeepSpeed ZeRO++; authors report ChemLLM outperforms similar-size open models, exceeds GPT-3.5, and achieves parity or better than GPT-4 on most ChemBench tasks. They also report general-bench scores: MMLU 65.6, C-Eval 67.2, GSM8K 67.2, C-MHChem 76.4. D

Problem Statement

General LLMs lack chemistry-specific knowledge and struggle with structured chemical data (databases, SMILES). The field also lacks a fair benchmark that tests chemistry skills while avoiding output-style scoring bias. The paper aims to build a chemistry-tuned LLM, a large instruction dataset, and a multiple-choice benchmark to measure chemical competency.

Main Contribution

ChemData: a template-based instruction tuning dataset (authors report ~7M Q&A) converting structured chemical data into dialogue-style instructions.

ChemBench: a 4,100-item multiple-choice benchmark covering nine core molecule and reaction tasks to reduce evaluation bias from generative output style.

Key Findings

ChemData size and scope

Numbers7M instruction Q&A (authors' dataset summary)

Practical UseYou can fine-tune / evaluate 7M chemistry-style instruction examples to teach an LLM molecular and reaction tasks.

Evidence RefFig.2, main text ChemData section

ChemBench composition

Numbers4,100 multiple-choice questions across nine tasks

Practical UseUse this MCQ set to measure factual chemical knowledge while minimizing reward for writing style.

Evidence RefFig.2, main text ChemBench section

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ChemData size	7M instruction Q&A	—	—	ChemData	Authors' dataset summary	Fig.2
ChemBench size	4,100 multiple-choice questions	—	—	ChemBench	Main text and Fig.2	Fig.2

What To Try In 7 Days

Run ChemBench via OpenCompass to evaluate your in-house models quickly

Fine-tune a 7B open model with a small slice of ChemData templates for molecule tasks

Test ChemLLM translations and literature-extraction examples on your domain texts

Agent Features

Architectures

domain-specific LLMautoregressive transformer

Optimization Features

Infra Optimization

2 machines × 8 A100 GPUs (authors' cluster)

Model Optimization

LoRAmixed-precision (bfloat16)

System Optimization

SLURM distributed trainingper-GPU batch size 8 (total 128)

Training Optimization

DeepSpeed ZeRO++ (parameter slicing/offload)AdamW optimizer, linear LR decay with warmupNEFTune embedding noise (alpha=5)

Inference Optimization

KV cacheFlashAttention-2

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://huggingface.co/AI4Chem

Data URLs

https://huggingface.co/AI4Chem

Risks & Boundaries

Limitations

ChemBench focuses on multiple-choice tasks and may not capture free-form generation issues or lab-safety risks

SMILES and IUPAC conversions can be brittle; template-generated dialog may carry template bias

When Not To Use

Do not use as a sole decision-maker for lab protocols or experimental execution without expert verification

Avoid trusting raw molecular designs or synthesis paths without chemical validation and simulation

Failure Modes

Incorrect SMILES/IUPAC generation or syntax errors

Overconfident but wrong property predictions or reaction outcomes

Core Entities

Models

ChemLLMInternLM2-Base-7BInternLM2-Chat-7BLLaMA2MistralChatGLM3QwenGPT-3.5GPT-4

Metrics

Accuracycross-entropy losstraining steps/epochs

Datasets

ChemDataMulti-CorpusFireFlyOpenOrcaUltraChatChemBenchMMLUC-EvalGSM8KC-MHChem

Benchmarks

ChemBenchMMLUC-EvalGSM8KC-MHChem

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChemData size and scope

ChemBench composition

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding