Lingshu: a medical multimodal foundation model trained on curated medical+general data with MedEvalKit evaluation

June 8, 202510 min

Overview

Decision SnapshotNeeds Validation

Well-documented staged training and ablations give clear guidance for practitioners, but final medical-grade deployment requires more expert validation and higher-quality, audited datasets.

Citations4

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 55%

Novelty: 55%

Authors

LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong

Links

Abstract / PDF

Why It Matters For Business

A carefully curated multimodal medical dataset plus staged tuning produces practical, near-proprietary medical QA and reporting performance while enabling smaller, cheaper models for deployment.

Who Should Care

Summary TLDR

This paper introduces Lingshu, a multimodal foundation model for medicine built from a large, curated mix of open-source medical data, general-domain data, and 1.3M synthetic medical samples. The team trains 7B and 32B variants with a four-stage recipe (shallow alignment, deep alignment, instruction tuning, optional RL with verifiable rewards). They release MedEvalKit, a unified benchmark suite of 152K+ samples across multimodal QA, text QA, and report generation. Lingshu-32B achieves an average 66.6 score across seven multimodal benchmarks; ablations show medical text and curated captions are especially valuable. RL with verifiable rewards gives marginal, mixed gains in this initial study.

Problem Statement

General-purpose multimodal LLMs struggle in medicine because medical vision-text distributions differ from web data, distilled training often misses non-imaging medical knowledge, noisy distillation increases hallucination risk, and existing models lack tailored medical reasoning and unified, reproducible evaluation.

Main Contribution

A large, careful data curation pipeline that collects open medical multimodal/text data plus general-domain sources and synthesizes high-quality captions, VQA, OCR and chain-of-thought (CoT) samples.

Lingshu foundation models (7B and 32B) trained with a four-stage recipe: medical shallow alignment, medical deep alignment, medical instruction tuning, and optional medical-oriented RL (GRPO/RLVR).

Key Findings

Training data scale and mix: 3.75M open-source medical samples + 1.30M synthetic medical samples.

Numbers3.75M open + 1.30M synthetic (§2.3)

Practical UseCollect both real and synthetic medical data early; synthetic captions/CoTs materially increase modality and reasoning coverage.

Evidence Ref§2.3, Dataset Summary

Top multimodal performance: Lingshu-32B average 66.6 across seven medical multimodal benchmarks.

NumbersAvg. 66.6 (Lingshu-32B) on 7 benchmarks (Table 6)

Practical UseA specialized training recipe plus curated medical data can close much of the gap to proprietary systems for multimodal medical QA on evaluated benchmarks.

Evidence RefTable 6, §5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Multimodal QA average (Lingshu-32B)66.6 average across 7 multimodal benchmarksBest open-source + proprietary baselinesTable 6 (7 multimodal tasks)Table 6: Lingshu-32B avg 66.6 across MMMU-Med, VQA-RAD, SLAKE, PathVQA, PMC-VQA, OMVQA, MedXQATable 6
Multimodal QA average (Lingshu-7B)61.8 average across 7 multimodal benchmarksBest open-source <10B baseline (internals)+4.5 vs best open-source <10BTable 6Table 6: Lingshu-7B avg 61.8, +4.5 over next best in <10B categoryTable 6

What To Try In 7 Days

Run MedEvalKit on your own or candidate models to get standardized medical benchmark baselines quickly.

Start with shallow visual alignment: freeze LLM and fine-tune vision encoder on clean medical captions.

Prioritize collecting and cleaning medical textual instruction and CoT samples — small text sets gave outsized gains in ablations (173K).

Agent Features

Tool Use
GRPOGPT-4o for data synthesisBiomedCLIP for modality classificationvLLM for evaluation acceleration
Frameworks
MedEvalKit
Architectures
Qwen2.5-VL backbone (vision encoder + LLM + MLP projector)
Collaboration
Human-in-the-loop for doctor preference annotation and verification

Optimization Features

Token Efficiency
SFT
Model Optimization
Unfreeze LLM only in later stages (stage-dependent freezing)Vision encoder + projector fine-tuned first
System Optimization
Chunk-based image deduplication to speed preprocessing
Training Optimization
Multi-stage shallow-to-deep training recipeData packing used only during instruction tuningAdamW optimizer with cosine LR scheduler and warmup
Inference Optimization
vLLM for high-throughput evaluation

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Open-source multimodal medical data still noisy: low-resolution images, annotation errors, uneven modality coverage.

Performance still lags top proprietary systems on some clinical reasoning tasks.

When Not To Use

Do not use Lingshu as an autonomous diagnostic system without expert oversight and validation.

Avoid deploying in high-stakes, legally regulated clinical decisions without certified evaluation.

Failure Modes

Hallucinations from synthetic or model-distilled training signals.

Overconfidence on out-of-distribution images or rare pathologies.

Core Entities

Models

Lingshu-7BLingshu-32BLingshu-RLQwen2.5-VL-Instruct (backbone)GPT-4.1Claude Sonnet 4Gemini-2.5-Flash

Metrics

AccuracyROUGE-LCIDErSembScoreRaTEScoreRadCliQ-v1

Datasets

ROCOv2LLaVA-MedPubMedVisionMIMIC-CXRPMC-OAQuilt-LLaVAMedICaTMedPix-2.0

Benchmarks

VQA-RADSLAKEPathVQAPMC-VQAOmniMedVQAMMMUMedXpertQAMIMIC-CXRIU-XrayCheXpert Plus

Context Entities

Models

InternVL2.5InternVL3MedGemmaHuatuoGPT-VBioMediX2

Metrics

Rouge-LCIDErRadCliQReXrank

Datasets

PathVQAVQA-Med-2019ROCOROCOv2OmniMedVQA

Benchmarks

MedQA-USMLEPubMedQAMedMCQAMMLU (medical subset)SuperGPQA