PharmaGPT: 13B–70B domain LLMs that outperform general models on pharmacy and chemistry tests

June 26, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

4

Authors

Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, Jing Sun, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang, Jianping Lu, Cheng Sun, Yixin Wang, Shengjie Yang, Yuancheng Li, Lu Jin, Lisha Zhang, Fu Bian, Zhongkai Ye, Lidong Pei, Changyang Tu

Links

Abstract / PDF

Why It Matters For Business

Focused domain models give near–GPT-4 quality on bio-pharma tasks with fewer resources, enabling faster, cheaper deployment for search, translation, tutoring, and R&D assistants; validate before clinical use.

Summary TLDR

PharmaGPT is a set of domain-specific multilingual language models (3B, 13B, 70B parameters) trained on a large, curated biomedical and chemistry corpus (stage1 153B tokens; stage2 43B). The authors add a 55,296-token tokenizer, instruction finetuning and RLHF (50k expert comparisons). On professional exams (NAPLEX, Chinese Pharmacist) and MMLU, PharmaGPT 0.7 scores in the 70–80% range and outperforms GPT-3.5 and matches or beats GPT-4 on some biomedical topics. The paper documents dataset curation, training recipe, and evaluation but does not release code or data in this paper.

Problem Statement

General-purpose LLMs lack the depth and precise terminology needed for bio-pharmaceutical and chemistry tasks. Practitioners need smaller, focused models trained on curated domain corpora to improve accuracy on professional exams, translation, and domain QA.

Main Contribution

Build and evaluate PharmaGPT family (3B train-from-scratch; 13B and 70B post-trained from LLaMA series).

Assemble a large domain corpus (stage1 153B tokens, stage2 43B) concentrated on biomedical, patents, papers, exams and supervised instruction data.

Extend tokenizer to 55,296 tokens (+23,296) to better handle Chinese and domain terms.

Use instruction finetuning and RLHF with a 50k expert-preference dataset and PPO to align outputs.

Comprehensive benchmark: NAPLEX, Chinese Pharmacist Exam, MMLU, and biomedical translation (BLEU).

Key Findings

PharmaGPT 0.7 scores 66–76 on NAPLEX sections, outperforming earlier PharmaGPT versions and GPT-3.5-turbo.

NumbersNAPLEX I/II/III = 66 / 68 / 76 (PharmaGPT 0.7) [Table 4]

PharmaGPT 0.7 achieves better biomedical translation BLEU scores than GPT-3.5, Claude3 and Google on tested set.

NumbersBLEU paragraph/sentence/word = 30 / 18 / 10 vs GPT-3.5 27 / 15 / 8 [Fig 7]

MMLU and specialized tests place PharmaGPT mostly in the 80–90% range and close to or above GPT-4 on some biomedical topics.

NumbersMMLU general / biomedical tasks ≈ 80–90% (reported range)

Training used large, staged domain pretraining and tokenizer expansion.

NumbersPretraining tokens = 153B (stage1) + 43B (stage2); vocab = 55,296 (+23,296)

Human preferences and RLHF were used at scale to align the model.

NumbersRLHF dataset = 50,000 expert-ranked instruction responses; PPO used for optimization

Results

NAPLEX (PharmaGPT 0.7)

ValueI 66; II 68; III 76

BaselinePharmaGPT 0.5 = I 57; II 59; III 58

Chinese Pharmacist Exam (PharmaGPT 0.7)

Valueoverall categories ≈ 70–80%

BaselineGPT-3.5 and in places GPT-4 (lower on some categories)

Biomedical translation (BLEU)

ValueParagraph 30; Sentence 18; Word 10 (PharmaGPT 0.7)

BaselineGPT-3.5: 27/15/8; Claude3: 26/16/9; Google: 27/16/9

MMLU

Value≈ 80–90% on many tasks; strong in biomedical topics

BaselineGPT-3.5 lower; GPT-4 comparable on many general tasks

Who Should Care

What To Try In 7 Days

Run a small continued-pretraining pass on your domain docs (10–50B tokens equivalence) using an LLaMA checkpoint.

Add domain-specific tokens via SentencePiece and expand vocabulary for jargon-heavy languages.

Finetune an existing LLaMA/Alpaca-style model on a few thousand in-domain instruction pairs and sample outputs for review by experts.

Optimization Features

Token Efficiency

  • bpe/SentencePiece tokenizer optimized for Chinese and domain terms

Infra Optimization

  • tensor parallelism TP=8, pipeline PP up to 16 noted in training table

Model Optimization

  • post-training from LLaMA for 13B/70B
  • vocabulary expansion to handle domain terms

System Optimization

  • data deduplication and privacy-focused redaction in preprocessing

Training Optimization

  • two-stage continued pretraining (153B + 43B tokens)
  • instruction finetuning with weighted loss and zeroed user-instruction tokens
  • RLHF with PPO and a dedicated reward model

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Proprietary dataset and no public code/data in paper limit reproducibility.
  • Potential biases from domain sources and language focus (mainly Chinese and English).
  • Evaluation uses exams and benchmarks but lacks clinical prospective validation.
  • Claims of outperforming GPT-4 are limited to some biomedical categories and specific tests.

When Not To Use

  • As an authoritative clinical decision tool without external validation and human oversight.
  • In languages or subdomains not well covered by the training corpus.
  • Where full reproducibility and open-source release are required.

Failure Modes

  • Hallucinations on novel chemical or clinical scenarios despite RAG mitigation.
  • Overconfidence on borderline or low-evidence topics.
  • Privacy leakage if redaction misses sensitive items in proprietary corpora.

Core Entities

Models

  • PharmaGPT-3B
  • PharmaGPT-13B
  • PharmaGPT-70B
  • PharmaGPT versions 0.1/0.3/0.5/0.7

Metrics

  • Exam percent scores (%, NAPLEX/Chinese exam)
  • BLEU (translation)
  • Accuracy

Datasets

  • Proprietary bio-pharma corpus (stage1 153B tokens, stage2 43B tokens)
  • Instruction finetuning data (several hundred thousand prompts)
  • RLHF preference dataset (50,000 expert comparisons)

Benchmarks

  • NAPLEX (North American Pharmacist Licensure Examination)
  • Chinese Pharmacist Examination
  • MMLU
  • Biomedical translation BLEU test

Context Entities

Datasets

  • CommonCrawl-derived web/news/patent/paper corpora (as used in stage1)
  • Specialized sources: patents, conference proceedings, exam banks, MedRxiv/BioRxiv