MAIRA-2: a multimodal chest X‑ray model that generates grounded findings and RadFact, an LLM-based sentence-level evaluator

Overview

Decision SnapshotNeeds Validation

MAIRA-2 and RadFact are practically useful for draft-generation and model evaluation; however, sentence-level factual errors and missed findings mean human review is required before clinical use.

Citations13

Evidence Strength0.75

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 40%

Novelty: 60%

Authors

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, Stephanie L. Hyland

Links

Abstract / PDF / Code

Why It Matters For Business

MAIRA-2 can produce editable, locally-grounded draft radiology findings and an LLM-based evaluator (RadFact); this shortens reviewer effort and supports rapid prototyping of clinical draft-assist tools, but human oversight remains mandatory.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

MAIRA-2 is a chest X‑ray multimodal system that (1) generates the full Findings text and (optionally) a bounding box for each finding (grounded reporting), (2) integrates richer report context (lateral view, prior studies, Indication, Technique, Comparison), and (3) introduces RadFact, an LLM-based sentence-level entailment evaluator for factuality and grounding. MAIRA-2 sets a new state-of-the-art on MIMIC-CXR findings generation and demonstrates feasible grounded reporting, but RadFact and clinician review show sizable remaining error rates and missed findings, so outputs are currently useful as draft reports requiring human review.

Problem Statement

Automatically drafting radiology Findings must be both clinically correct and verifiable. Existing systems either hallucinate, omit findings, or lack ways to localise and verify each reported observation. The paper aims to (a) add grounding (per-sentence bounding boxes) to report generation and (b) create an evaluation method that checks factuality at the sentence level without hard-coded finding classes.

Main Contribution

Define grounded radiology reporting: each finding sentence can include bounding box(es) locating that finding on the image.

Introduce RadFact: an LLM-based sentence-level entailment framework to score factual precision and recall and to evaluate grounding.

Key Findings

MAIRA-2 achieves strong lexical and clinical gains on MIMIC-CXR compared to earlier systems.

NumbersROUGE-L 38.4; BLEU-4 23.1; RadGraph-F1 34.6 (Table D.1)

Practical UseExpect better word-level matches and improved clinical label scores when using MAIRA-2 outputs as a draft compared to prior models.

Evidence RefResults, Table D.1

RadFact shows MAIRA-2 still makes many sentence-level factual errors on MIMIC-CXR.

NumbersRadFact logical precision 52.9%, logical recall 48.2% on MIMIC-CXR (Table D.1)

Practical UseTreat outputs as first drafts that need expert verification; do not use for unsupervised clinical sign-off.

Evidence RefResults, Table D.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE-L (MIMIC-CXR findings generation)	38.4 [37.9, 39.0] (median, 95% CI)	MAIRA-1 ROUGE-L 28.9	+9.5 absolute	MIMIC-CXR test	Table D.1	Table D.1
RadFact logical precision / recall (MIMIC-CXR)	52.9% / 48.2% (median)	—	—	MIMIC-CXR test	Table D.1	Table D.1

What To Try In 7 Days

Run RadFact on your report-generation models to get sentence-level factuality signals.

If you have priors/lateral images, add them to model prompts and measure changes using ablations.

Collect a small grounded subset (100–500 studies) and test bounding-box-assisted review to estimate reviewer time savings.

Optimization Features

Token Efficiency

discretised box coordinate tokens (separate x/y tokens, N=100)

System Optimization

linear RoPE scaling to extend LLM context for multiple images

Training Optimization

multitask training on grounded and non-grounded examplesfrozen image encoder with trained adapter and LLM fine-tuning

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/microsoft/RadFact

Risks & Boundaries

Limitations

RadFact uses strict entailment and penalises partial descriptions; it does not weight clinical severity.

Bounding boxes are often imprecise (low spatial precision), limiting standalone automated use.

When Not To Use

Do not use MAIRA-2 outputs for autonomous clinical sign-off.

Avoid deploying without prior/lateral/technique inputs when those inputs matter for decisions.

Failure Modes

Hallucinated temporal comparisons when prior is missing or not provided.

Spurious lateral mentions when lateral view or technique is absent.

Core Entities

Models

MAIRA-2MAIRA-1Rad-DINO-MAIRA-2Vicuna-7B-v1.5Llama3-70B-InstructRadFact-Llama3

Metrics

RadFact (logical precision/recall, grounding/spatial)RadGraph-F1CheXbert / CheXpert F1RadCliQROUGE-LBLEU-4

Datasets

MIMIC-CXRPadChestUSMixIU-XrayGR-BenchPadChest-GRMS-CXR

Benchmarks

MIMIC-CXR findings generationGrounded reporting (GR-Bench, PadChest-GR)

Context Entities

Models

GPT-4 (used for some preprocessing and sentence extraction)Meta-Llama3-70B-Instruct (RadFact backbone)

Datasets

PadChest-GR (concurrently developed grounded dataset)USMix (private dataset used for GR-Bench)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MAIRA-2 achieves strong lexical and clinical gains on MIMIC-CXR compared to earlier systems.

RadFact shows MAIRA-2 still makes many sentence-level factual errors on MIMIC-CXR.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding

DiaHalu: 1,103 multi-turn dialogues to test hallucination in chat-style LLMs

Key finding

An open leaderboard that measures LLM hallucinations across 15 tasks and 20 models

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding