PharmaGPT: 13B–70B domain LLMs that outperform general models on pharmacy and chemistry tests

Overview

Decision SnapshotNeeds Validation

The paper provides concrete training recipes, token counts, and benchmark results but relies on proprietary data and does not release code or datasets, so engineering reproduction requires internal resources or similar corpora.

Citations4

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, Jing Sun, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang, Jianping Lu, Cheng Sun, Yixin Wang, Shengjie Yang, Yuancheng Li, Lu Jin, Lisha Zhang, Fu Bian, Zhongkai Ye, Lidong Pei, Changyang Tu

Links

Abstract / PDF

Why It Matters For Business

Focused domain models give near–GPT-4 quality on bio-pharma tasks with fewer resources, enabling faster, cheaper deployment for search, translation, tutoring, and R&D assistants; validate before clinical use.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

PharmaGPT is a set of domain-specific multilingual language models (3B, 13B, 70B parameters) trained on a large, curated biomedical and chemistry corpus (stage1 153B tokens; stage2 43B). The authors add a 55,296-token tokenizer, instruction finetuning and RLHF (50k expert comparisons). On professional exams (NAPLEX, Chinese Pharmacist) and MMLU, PharmaGPT 0.7 scores in the 70–80% range and outperforms GPT-3.5 and matches or beats GPT-4 on some biomedical topics. The paper documents dataset curation, training recipe, and evaluation but does not release code or data in this paper.

Problem Statement

General-purpose LLMs lack the depth and precise terminology needed for bio-pharmaceutical and chemistry tasks. Practitioners need smaller, focused models trained on curated domain corpora to improve accuracy on professional exams, translation, and domain QA.

Main Contribution

Build and evaluate PharmaGPT family (3B train-from-scratch; 13B and 70B post-trained from LLaMA series).

Assemble a large domain corpus (stage1 153B tokens, stage2 43B) concentrated on biomedical, patents, papers, exams and supervised instruction data.

Key Findings

PharmaGPT 0.7 scores 66–76 on NAPLEX sections, outperforming earlier PharmaGPT versions and GPT-3.5-turbo.

NumbersNAPLEX I/II/III = 66 / 68 / 76 (PharmaGPT 0.7) [Table 4]

Practical UseA domain-trained LLM with far fewer params can give exam-level knowledge useful for retrieval, tutoring, and test automation; validate before clinical use.

Evidence RefTable 4, Fig 5

PharmaGPT 0.7 achieves better biomedical translation BLEU scores than GPT-3.5, Claude3 and Google on tested set.

NumbersBLEU paragraph/sentence/word = 30 / 18 / 10 vs GPT-3.5 27 / 15 / 8 [Fig 7]

Practical UseUse PharmaGPT for higher-quality domain translations (papers, reports), then post-edit with experts.

Evidence RefFigure 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
NAPLEX (PharmaGPT 0.7)	I 66; II 68; III 76	PharmaGPT 0.5 = I 57; II 59; III 58	I +9; II +9; III +18	NAPLEX sections	Table 4; Fig 5	Table 4
Chinese Pharmacist Exam (PharmaGPT 0.7)	overall categories ≈ 70–80%	GPT-3.5 and in places GPT-4 (lower on some categories)	PharmaGPT outperforms GPT-3.5 and exceeds GPT-4 in some categories (reported)	Chinese pharmacist categories	Figure 6; Section 4.2	Figure 6

What To Try In 7 Days

Run a small continued-pretraining pass on your domain docs (10–50B tokens equivalence) using an LLaMA checkpoint.

Add domain-specific tokens via SentencePiece and expand vocabulary for jargon-heavy languages.

Finetune an existing LLaMA/Alpaca-style model on a few thousand in-domain instruction pairs and sample outputs for review by experts.

Optimization Features

Token Efficiency

bpe/SentencePiece tokenizer optimized for Chinese and domain terms

Infra Optimization

tensor parallelism TP=8, pipeline PP up to 16 noted in training table

Model Optimization

post-training from LLaMA for 13B/70Bvocabulary expansion to handle domain terms

System Optimization

data deduplication and privacy-focused redaction in preprocessing

Training Optimization

two-stage continued pretraining (153B + 43B tokens)instruction finetuning with weighted loss and zeroed user-instruction tokensRLHF with PPO and a dedicated reward model

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Proprietary dataset and no public code/data in paper limit reproducibility.

Potential biases from domain sources and language focus (mainly Chinese and English).

When Not To Use

As an authoritative clinical decision tool without external validation and human oversight.

In languages or subdomains not well covered by the training corpus.

Failure Modes

Hallucinations on novel chemical or clinical scenarios despite RAG mitigation.

Overconfidence on borderline or low-evidence topics.

Core Entities

Models

PharmaGPT-3BPharmaGPT-13BPharmaGPT-70BPharmaGPT versions 0.1/0.3/0.5/0.7

Metrics

Exam percent scores (%, NAPLEX/Chinese exam)BLEU (translation)Accuracy

Datasets

Proprietary bio-pharma corpus (stage1 153B tokens, stage2 43B tokens)Instruction finetuning data (several hundred thousand prompts)RLHF preference dataset (50,000 expert comparisons)

Benchmarks

NAPLEX (North American Pharmacist Licensure Examination)Chinese Pharmacist ExaminationMMLUBiomedical translation BLEU test

Context Entities

Datasets

CommonCrawl-derived web/news/patent/paper corpora (as used in stage1)Specialized sources: patents, conference proceedings, exam banks, MedRxiv/BioRxiv

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PharmaGPT 0.7 scores 66–76 on NAPLEX sections, outperforming earlier PharmaGPT versions and GPT-3.5-turbo.

PharmaGPT 0.7 achieves better biomedical translation BLEU scores than GPT-3.5, Claude3 and Google on tested set.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Key finding

ChipExpert: Open-source LLM tuned for integrated-circuit design

Key finding