Overview
Production Readiness
0.55
Novelty Score
0.55
Cost Impact Score
0.5
Citation Count
4
Why It Matters For Business
A carefully curated multimodal medical dataset plus staged tuning produces practical, near-proprietary medical QA and reporting performance while enabling smaller, cheaper models for deployment.
Summary TLDR
This paper introduces Lingshu, a multimodal foundation model for medicine built from a large, curated mix of open-source medical data, general-domain data, and 1.3M synthetic medical samples. The team trains 7B and 32B variants with a four-stage recipe (shallow alignment, deep alignment, instruction tuning, optional RL with verifiable rewards). They release MedEvalKit, a unified benchmark suite of 152K+ samples across multimodal QA, text QA, and report generation. Lingshu-32B achieves an average 66.6 score across seven multimodal benchmarks; ablations show medical text and curated captions are especially valuable. RL with verifiable rewards gives marginal, mixed gains in this initial study.
Problem Statement
General-purpose multimodal LLMs struggle in medicine because medical vision-text distributions differ from web data, distilled training often misses non-imaging medical knowledge, noisy distillation increases hallucination risk, and existing models lack tailored medical reasoning and unified, reproducible evaluation.
Main Contribution
A large, careful data curation pipeline that collects open medical multimodal/text data plus general-domain sources and synthesizes high-quality captions, VQA, OCR and chain-of-thought (CoT) samples.
Lingshu foundation models (7B and 32B) trained with a four-stage recipe: medical shallow alignment, medical deep alignment, medical instruction tuning, and optional medical-oriented RL (GRPO/RLVR).
MedEvalKit: a unified evaluation framework that consolidates major multimodal/text medical benchmarks (152,066 samples, 121,622 images) and uses rule-based + LLM-as-judge scoring.
Comprehensive experiments showing Lingshu outperforms prior open-source MLLMs on most evaluated medical QA and report-generation metrics and ablations that quantify which data types matter most.
Key Findings
Training data scale and mix: 3.75M open-source medical samples + 1.30M synthetic medical samples.
Top multimodal performance: Lingshu-32B average 66.6 across seven medical multimodal benchmarks.
Smaller model competitiveness: Lingshu-7B avg 61.8 on multimodal benchmarks; +4.5 over best open-source <10B baseline.
Medical-text data is critical: removing 173K medical text samples causes substantial drops across tasks.
RL with verifiable rewards gave mixed, small changes: marginal gains on some benchmarks and declines on others.
Data scaling effect: accuracy rose from roughly 52% to 62% as training data increased to full scale.
Results
Multimodal QA average (Lingshu-32B)
Multimodal QA average (Lingshu-7B)
Text-only QA average (Lingshu-32B)
Report generation composite (RadCliQ -1, Lingshu-32B)
Effect of RL (Lingshu-RL-7B vs Lingshu-7B)
Who Should Care
What To Try In 7 Days
Run MedEvalKit on your own or candidate models to get standardized medical benchmark baselines quickly.
Start with shallow visual alignment: freeze LLM and fine-tune vision encoder on clean medical captions.
Prioritize collecting and cleaning medical textual instruction and CoT samples — small text sets gave outsized gains in ablations (173K).
Agent Features
Tool Use
- GRPO
- GPT-4o for data synthesis
- BiomedCLIP for modality classification
- vLLM for evaluation acceleration
Frameworks
- MedEvalKit
Architectures
- Qwen2.5-VL backbone (vision encoder + LLM + MLP projector)
Collaboration
- Human-in-the-loop for doctor preference annotation and verification
Optimization Features
Token Efficiency
- SFT
Model Optimization
- Unfreeze LLM only in later stages (stage-dependent freezing)
- Vision encoder + projector fine-tuned first
System Optimization
- Chunk-based image deduplication to speed preprocessing
Training Optimization
- Multi-stage shallow-to-deep training recipe
- Data packing used only during instruction tuning
- AdamW optimizer with cosine LR scheduler and warmup
Inference Optimization
- vLLM for high-throughput evaluation
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Open-source multimodal medical data still noisy: low-resolution images, annotation errors, uneven modality coverage.
- Performance still lags top proprietary systems on some clinical reasoning tasks.
- Reinforcement learning with verifiable rewards is preliminary and sensitive to reward design and data selection.
- Generalization to WSI, 3D imaging, genomics, and full clinical workflows remains to be proven.
When Not To Use
- Do not use Lingshu as an autonomous diagnostic system without expert oversight and validation.
- Avoid deploying in high-stakes, legally regulated clinical decisions without certified evaluation.
- Not suitable where access to private, proprietary patient data is required without further adaptation.
Failure Modes
- Hallucinations from synthetic or model-distilled training signals.
- Overconfidence on out-of-distribution images or rare pathologies.
- Poor performance when training/eval data overlap or contamination is present.
- Reward misspecification in RL causing optimization toward format rather than correct reasoning.
Core Entities
Models
- Lingshu-7B
- Lingshu-32B
- Lingshu-RL
- Qwen2.5-VL-Instruct (backbone)
- GPT-4.1
- Claude Sonnet 4
- Gemini-2.5-Flash
Metrics
- Accuracy
- ROUGE-L
- CIDEr
- SembScore
- RaTEScore
- RadCliQ-v1
Datasets
- ROCOv2
- LLaVA-Med
- PubMedVision
- MIMIC-CXR
- PMC-OA
- Quilt-LLaVA
- MedICaT
- MedPix-2.0
Benchmarks
- VQA-RAD
- SLAKE
- PathVQA
- PMC-VQA
- OmniMedVQA
- MMMU
- MedXpertQA
- MIMIC-CXR
- IU-Xray
- CheXpert Plus
Context Entities
Models
- InternVL2.5
- InternVL3
- MedGemma
- HuatuoGPT-V
- BioMediX2
Metrics
- Rouge-L
- CIDEr
- RadCliQ
- ReXrank
Datasets
- PathVQA
- VQA-Med-2019
- ROCO
- ROCOv2
- OmniMedVQA
Benchmarks
- MedQA-USMLE
- PubMedQA
- MedMCQA
- MMLU (medical subset)
- SuperGPQA

