Lingshu: a medical multimodal foundation model trained on curated medical+general data with MedEvalKit evaluation

Overview

Decision SnapshotNeeds Validation

Well-documented staged training and ablations give clear guidance for practitioners, but final medical-grade deployment requires more expert validation and higher-quality, audited datasets.

Citations4

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 55%

Novelty: 55%

Authors

LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong

Links

Abstract / PDF

Why It Matters For Business

A carefully curated multimodal medical dataset plus staged tuning produces practical, near-proprietary medical QA and reporting performance while enabling smaller, cheaper models for deployment.

Who Should Care

ML Engineer Product Manager CTO Data Scientist Founder

Summary TLDR

This paper introduces Lingshu, a multimodal foundation model for medicine built from a large, curated mix of open-source medical data, general-domain data, and 1.3M synthetic medical samples. The team trains 7B and 32B variants with a four-stage recipe (shallow alignment, deep alignment, instruction tuning, optional RL with verifiable rewards). They release MedEvalKit, a unified benchmark suite of 152K+ samples across multimodal QA, text QA, and report generation. Lingshu-32B achieves an average 66.6 score across seven multimodal benchmarks; ablations show medical text and curated captions are especially valuable. RL with verifiable rewards gives marginal, mixed gains in this initial study.

Problem Statement

General-purpose multimodal LLMs struggle in medicine because medical vision-text distributions differ from web data, distilled training often misses non-imaging medical knowledge, noisy distillation increases hallucination risk, and existing models lack tailored medical reasoning and unified, reproducible evaluation.

Main Contribution

A large, careful data curation pipeline that collects open medical multimodal/text data plus general-domain sources and synthesizes high-quality captions, VQA, OCR and chain-of-thought (CoT) samples.

Lingshu foundation models (7B and 32B) trained with a four-stage recipe: medical shallow alignment, medical deep alignment, medical instruction tuning, and optional medical-oriented RL (GRPO/RLVR).

Key Findings

Training data scale and mix: 3.75M open-source medical samples + 1.30M synthetic medical samples.

Numbers3.75M open + 1.30M synthetic (§2.3)

Practical UseCollect both real and synthetic medical data early; synthetic captions/CoTs materially increase modality and reasoning coverage.

Evidence Ref§2.3, Dataset Summary

Top multimodal performance: Lingshu-32B average 66.6 across seven medical multimodal benchmarks.

NumbersAvg. 66.6 (Lingshu-32B) on 7 benchmarks (Table 6)

Practical UseA specialized training recipe plus curated medical data can close much of the gap to proprietary systems for multimodal medical QA on evaluated benchmarks.

Evidence RefTable 6, §5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Multimodal QA average (Lingshu-32B)	66.6 average across 7 multimodal benchmarks	Best open-source + proprietary baselines	—	Table 6 (7 multimodal tasks)	Table 6: Lingshu-32B avg 66.6 across MMMU-Med, VQA-RAD, SLAKE, PathVQA, PMC-VQA, OMVQA, MedXQA	Table 6
Multimodal QA average (Lingshu-7B)	61.8 average across 7 multimodal benchmarks	Best open-source <10B baseline (internals)	+4.5 vs best open-source <10B	Table 6	Table 6: Lingshu-7B avg 61.8, +4.5 over next best in <10B category	Table 6

What To Try In 7 Days

Run MedEvalKit on your own or candidate models to get standardized medical benchmark baselines quickly.

Start with shallow visual alignment: freeze LLM and fine-tune vision encoder on clean medical captions.

Prioritize collecting and cleaning medical textual instruction and CoT samples — small text sets gave outsized gains in ablations (173K).

Agent Features

Tool Use

GRPOGPT-4o for data synthesisBiomedCLIP for modality classificationvLLM for evaluation acceleration

Frameworks

MedEvalKit

Architectures

Qwen2.5-VL backbone (vision encoder + LLM + MLP projector)

Collaboration

Human-in-the-loop for doctor preference annotation and verification

Optimization Features

Token Efficiency

SFT

Model Optimization

Unfreeze LLM only in later stages (stage-dependent freezing)Vision encoder + projector fine-tuned first

System Optimization

Chunk-based image deduplication to speed preprocessing

Training Optimization

Multi-stage shallow-to-deep training recipeData packing used only during instruction tuningAdamW optimizer with cosine LR scheduler and warmup

Inference Optimization

vLLM for high-throughput evaluation

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Open-source multimodal medical data still noisy: low-resolution images, annotation errors, uneven modality coverage.

Performance still lags top proprietary systems on some clinical reasoning tasks.

When Not To Use

Do not use Lingshu as an autonomous diagnostic system without expert oversight and validation.

Avoid deploying in high-stakes, legally regulated clinical decisions without certified evaluation.

Failure Modes

Hallucinations from synthetic or model-distilled training signals.

Overconfidence on out-of-distribution images or rare pathologies.

Core Entities

Models

Lingshu-7BLingshu-32BLingshu-RLQwen2.5-VL-Instruct (backbone)GPT-4.1Claude Sonnet 4Gemini-2.5-Flash

Metrics

AccuracyROUGE-LCIDErSembScoreRaTEScoreRadCliQ-v1

Datasets

ROCOv2LLaVA-MedPubMedVisionMIMIC-CXRPMC-OAQuilt-LLaVAMedICaTMedPix-2.0

Benchmarks

VQA-RADSLAKEPathVQAPMC-VQAOmniMedVQAMMMUMedXpertQAMIMIC-CXRIU-XrayCheXpert Plus

Context Entities

Models

InternVL2.5InternVL3MedGemmaHuatuoGPT-VBioMediX2

Metrics

Rouge-LCIDErRadCliQReXrank

Datasets

PathVQAVQA-Med-2019ROCOROCOv2OmniMedVQA

Benchmarks

MedQA-USMLEPubMedQAMedMCQAMMLU (medical subset)SuperGPQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Training data scale and mix: 3.75M open-source medical samples + 1.30M synthetic medical samples.

Top multimodal performance: Lingshu-32B average 66.6 across seven medical multimodal benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding