Lingshu: a medical multimodal foundation model trained on curated medical+general data with MedEvalKit evaluation

June 8, 202510 min

Overview

Production Readiness

0.55

Novelty Score

0.55

Cost Impact Score

0.5

Citation Count

4

Authors

LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong

Links

Abstract / PDF

Why It Matters For Business

A carefully curated multimodal medical dataset plus staged tuning produces practical, near-proprietary medical QA and reporting performance while enabling smaller, cheaper models for deployment.

Summary TLDR

This paper introduces Lingshu, a multimodal foundation model for medicine built from a large, curated mix of open-source medical data, general-domain data, and 1.3M synthetic medical samples. The team trains 7B and 32B variants with a four-stage recipe (shallow alignment, deep alignment, instruction tuning, optional RL with verifiable rewards). They release MedEvalKit, a unified benchmark suite of 152K+ samples across multimodal QA, text QA, and report generation. Lingshu-32B achieves an average 66.6 score across seven multimodal benchmarks; ablations show medical text and curated captions are especially valuable. RL with verifiable rewards gives marginal, mixed gains in this initial study.

Problem Statement

General-purpose multimodal LLMs struggle in medicine because medical vision-text distributions differ from web data, distilled training often misses non-imaging medical knowledge, noisy distillation increases hallucination risk, and existing models lack tailored medical reasoning and unified, reproducible evaluation.

Main Contribution

A large, careful data curation pipeline that collects open medical multimodal/text data plus general-domain sources and synthesizes high-quality captions, VQA, OCR and chain-of-thought (CoT) samples.

Lingshu foundation models (7B and 32B) trained with a four-stage recipe: medical shallow alignment, medical deep alignment, medical instruction tuning, and optional medical-oriented RL (GRPO/RLVR).

MedEvalKit: a unified evaluation framework that consolidates major multimodal/text medical benchmarks (152,066 samples, 121,622 images) and uses rule-based + LLM-as-judge scoring.

Comprehensive experiments showing Lingshu outperforms prior open-source MLLMs on most evaluated medical QA and report-generation metrics and ablations that quantify which data types matter most.

Key Findings

Training data scale and mix: 3.75M open-source medical samples + 1.30M synthetic medical samples.

Numbers3.75M open + 1.30M synthetic (§2.3)

Top multimodal performance: Lingshu-32B average 66.6 across seven medical multimodal benchmarks.

NumbersAvg. 66.6 (Lingshu-32B) on 7 benchmarks (Table 6)

Smaller model competitiveness: Lingshu-7B avg 61.8 on multimodal benchmarks; +4.5 over best open-source <10B baseline.

Numbers61.8 avg; +4.5 vs best <10B (Table 6)

Medical-text data is critical: removing 173K medical text samples causes substantial drops across tasks.

NumbersRemoval of 173K medical text → notable score drops on 5/7 tasks (Table 10)

RL with verifiable rewards gave mixed, small changes: marginal gains on some benchmarks and declines on others.

NumbersLingshu-RL-7B: +0.7% on MMMU-Med and PMC-VQA; −1.6% on OMVQA (Table 9)

Data scaling effect: accuracy rose from roughly 52% to 62% as training data increased to full scale.

NumbersAggregate accuracy from ~52% → ~62% as data approaches 100% (Figure 10)

Results

Multimodal QA average (Lingshu-32B)

Value66.6 average across 7 multimodal benchmarks

BaselineBest open-source + proprietary baselines

Multimodal QA average (Lingshu-7B)

Value61.8 average across 7 multimodal benchmarks

BaselineBest open-source <10B baseline (internals)

Text-only QA average (Lingshu-32B)

Value61.8 average across medical text benchmarks

BaselineInternVL3-38B and other large open models

Report generation composite (RadCliQ -1, Lingshu-32B)

Value130.4 (scaled, higher is better after reciprocal)

BaselineOpen-source baselines in Table 8

Effect of RL (Lingshu-RL-7B vs Lingshu-7B)

ValueAverage performance roughly unchanged; small per-dataset shifts

BaselineLingshu-7B

Who Should Care

What To Try In 7 Days

Run MedEvalKit on your own or candidate models to get standardized medical benchmark baselines quickly.

Start with shallow visual alignment: freeze LLM and fine-tune vision encoder on clean medical captions.

Prioritize collecting and cleaning medical textual instruction and CoT samples — small text sets gave outsized gains in ablations (173K).

Agent Features

Tool Use

  • GRPO
  • GPT-4o for data synthesis
  • BiomedCLIP for modality classification
  • vLLM for evaluation acceleration

Frameworks

  • MedEvalKit

Architectures

  • Qwen2.5-VL backbone (vision encoder + LLM + MLP projector)

Collaboration

  • Human-in-the-loop for doctor preference annotation and verification

Optimization Features

Token Efficiency

  • SFT

Model Optimization

  • Unfreeze LLM only in later stages (stage-dependent freezing)
  • Vision encoder + projector fine-tuned first

System Optimization

  • Chunk-based image deduplication to speed preprocessing

Training Optimization

  • Multi-stage shallow-to-deep training recipe
  • Data packing used only during instruction tuning
  • AdamW optimizer with cosine LR scheduler and warmup

Inference Optimization

  • vLLM for high-throughput evaluation

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Open-source multimodal medical data still noisy: low-resolution images, annotation errors, uneven modality coverage.
  • Performance still lags top proprietary systems on some clinical reasoning tasks.
  • Reinforcement learning with verifiable rewards is preliminary and sensitive to reward design and data selection.
  • Generalization to WSI, 3D imaging, genomics, and full clinical workflows remains to be proven.

When Not To Use

  • Do not use Lingshu as an autonomous diagnostic system without expert oversight and validation.
  • Avoid deploying in high-stakes, legally regulated clinical decisions without certified evaluation.
  • Not suitable where access to private, proprietary patient data is required without further adaptation.

Failure Modes

  • Hallucinations from synthetic or model-distilled training signals.
  • Overconfidence on out-of-distribution images or rare pathologies.
  • Poor performance when training/eval data overlap or contamination is present.
  • Reward misspecification in RL causing optimization toward format rather than correct reasoning.

Core Entities

Models

  • Lingshu-7B
  • Lingshu-32B
  • Lingshu-RL
  • Qwen2.5-VL-Instruct (backbone)
  • GPT-4.1
  • Claude Sonnet 4
  • Gemini-2.5-Flash

Metrics

  • Accuracy
  • ROUGE-L
  • CIDEr
  • SembScore
  • RaTEScore
  • RadCliQ-v1

Datasets

  • ROCOv2
  • LLaVA-Med
  • PubMedVision
  • MIMIC-CXR
  • PMC-OA
  • Quilt-LLaVA
  • MedICaT
  • MedPix-2.0

Benchmarks

  • VQA-RAD
  • SLAKE
  • PathVQA
  • PMC-VQA
  • OmniMedVQA
  • MMMU
  • MedXpertQA
  • MIMIC-CXR
  • IU-Xray
  • CheXpert Plus

Context Entities

Models

  • InternVL2.5
  • InternVL3
  • MedGemma
  • HuatuoGPT-V
  • BioMediX2

Metrics

  • Rouge-L
  • CIDEr
  • RadCliQ
  • ReXrank

Datasets

  • PathVQA
  • VQA-Med-2019
  • ROCO
  • ROCOv2
  • OmniMedVQA

Benchmarks

  • MedQA-USMLE
  • PubMedQA
  • MedMCQA
  • MMLU (medical subset)
  • SuperGPQA