PharmaGPT: 13B–70B domain LLMs that outperform general models on pharmacy and chemistry tests

June 26, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper provides concrete training recipes, token counts, and benchmark results but relies on proprietary data and does not release code or datasets, so engineering reproduction requires internal resources or similar corpora.

Citations4

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, Jing Sun, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang, Jianping Lu, Cheng Sun, Yixin Wang, Shengjie Yang, Yuancheng Li, Lu Jin, Lisha Zhang, Fu Bian, Zhongkai Ye, Lidong Pei, Changyang Tu

Links

Abstract / PDF

Why It Matters For Business

Focused domain models give near–GPT-4 quality on bio-pharma tasks with fewer resources, enabling faster, cheaper deployment for search, translation, tutoring, and R&D assistants; validate before clinical use.

Who Should Care

Summary TLDR

PharmaGPT is a set of domain-specific multilingual language models (3B, 13B, 70B parameters) trained on a large, curated biomedical and chemistry corpus (stage1 153B tokens; stage2 43B). The authors add a 55,296-token tokenizer, instruction finetuning and RLHF (50k expert comparisons). On professional exams (NAPLEX, Chinese Pharmacist) and MMLU, PharmaGPT 0.7 scores in the 70–80% range and outperforms GPT-3.5 and matches or beats GPT-4 on some biomedical topics. The paper documents dataset curation, training recipe, and evaluation but does not release code or data in this paper.

Problem Statement

General-purpose LLMs lack the depth and precise terminology needed for bio-pharmaceutical and chemistry tasks. Practitioners need smaller, focused models trained on curated domain corpora to improve accuracy on professional exams, translation, and domain QA.

Main Contribution

Build and evaluate PharmaGPT family (3B train-from-scratch; 13B and 70B post-trained from LLaMA series).

Assemble a large domain corpus (stage1 153B tokens, stage2 43B) concentrated on biomedical, patents, papers, exams and supervised instruction data.

Key Findings

PharmaGPT 0.7 scores 66–76 on NAPLEX sections, outperforming earlier PharmaGPT versions and GPT-3.5-turbo.

NumbersNAPLEX I/II/III = 66 / 68 / 76 (PharmaGPT 0.7) [Table 4]

Practical UseA domain-trained LLM with far fewer params can give exam-level knowledge useful for retrieval, tutoring, and test automation; validate before clinical use.

Evidence RefTable 4, Fig 5

PharmaGPT 0.7 achieves better biomedical translation BLEU scores than GPT-3.5, Claude3 and Google on tested set.

NumbersBLEU paragraph/sentence/word = 30 / 18 / 10 vs GPT-3.5 27 / 15 / 8 [Fig 7]

Practical UseUse PharmaGPT for higher-quality domain translations (papers, reports), then post-edit with experts.

Evidence RefFigure 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
NAPLEX (PharmaGPT 0.7)I 66; II 68; III 76PharmaGPT 0.5 = I 57; II 59; III 58I +9; II +9; III +18NAPLEX sectionsTable 4; Fig 5Table 4
Chinese Pharmacist Exam (PharmaGPT 0.7)overall categories ≈ 7080%GPT-3.5 and in places GPT-4 (lower on some categories)PharmaGPT outperforms GPT-3.5 and exceeds GPT-4 in some categories (reported)Chinese pharmacist categoriesFigure 6; Section 4.2Figure 6

What To Try In 7 Days

Run a small continued-pretraining pass on your domain docs (10–50B tokens equivalence) using an LLaMA checkpoint.

Add domain-specific tokens via SentencePiece and expand vocabulary for jargon-heavy languages.

Finetune an existing LLaMA/Alpaca-style model on a few thousand in-domain instruction pairs and sample outputs for review by experts.

Optimization Features

Token Efficiency
bpe/SentencePiece tokenizer optimized for Chinese and domain terms
Infra Optimization
tensor parallelism TP=8, pipeline PP up to 16 noted in training table
Model Optimization
post-training from LLaMA for 13B/70Bvocabulary expansion to handle domain terms
System Optimization
data deduplication and privacy-focused redaction in preprocessing
Training Optimization
two-stage continued pretraining (153B + 43B tokens)instruction finetuning with weighted loss and zeroed user-instruction tokensRLHF with PPO and a dedicated reward model

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Proprietary dataset and no public code/data in paper limit reproducibility.

Potential biases from domain sources and language focus (mainly Chinese and English).

When Not To Use

As an authoritative clinical decision tool without external validation and human oversight.

In languages or subdomains not well covered by the training corpus.

Failure Modes

Hallucinations on novel chemical or clinical scenarios despite RAG mitigation.

Overconfidence on borderline or low-evidence topics.

Core Entities

Models

PharmaGPT-3BPharmaGPT-13BPharmaGPT-70BPharmaGPT versions 0.1/0.3/0.5/0.7

Metrics

Exam percent scores (%, NAPLEX/Chinese exam)BLEU (translation)Accuracy

Datasets

Proprietary bio-pharma corpus (stage1 153B tokens, stage2 43B tokens)Instruction finetuning data (several hundred thousand prompts)RLHF preference dataset (50,000 expert comparisons)

Benchmarks

NAPLEX (North American Pharmacist Licensure Examination)Chinese Pharmacist ExaminationMMLUBiomedical translation BLEU test

Context Entities

Datasets

CommonCrawl-derived web/news/patent/paper corpora (as used in stage1)Specialized sources: patents, conference proceedings, exam banks, MedRxiv/BioRxiv