MAIRA-2: a multimodal chest X‑ray model that generates grounded findings and RadFact, an LLM-based sentence-level evaluator

June 6, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.35

Citation Count

13

Authors

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, Stephanie L. Hyland

Links

Abstract / PDF

Why It Matters For Business

MAIRA-2 can produce editable, locally-grounded draft radiology findings and an LLM-based evaluator (RadFact); this shortens reviewer effort and supports rapid prototyping of clinical draft-assist tools, but human oversight remains mandatory.

Summary TLDR

MAIRA-2 is a chest X‑ray multimodal system that (1) generates the full Findings text and (optionally) a bounding box for each finding (grounded reporting), (2) integrates richer report context (lateral view, prior studies, Indication, Technique, Comparison), and (3) introduces RadFact, an LLM-based sentence-level entailment evaluator for factuality and grounding. MAIRA-2 sets a new state-of-the-art on MIMIC-CXR findings generation and demonstrates feasible grounded reporting, but RadFact and clinician review show sizable remaining error rates and missed findings, so outputs are currently useful as draft reports requiring human review.

Problem Statement

Automatically drafting radiology Findings must be both clinically correct and verifiable. Existing systems either hallucinate, omit findings, or lack ways to localise and verify each reported observation. The paper aims to (a) add grounding (per-sentence bounding boxes) to report generation and (b) create an evaluation method that checks factuality at the sentence level without hard-coded finding classes.

Main Contribution

Define grounded radiology reporting: each finding sentence can include bounding box(es) locating that finding on the image.

Introduce RadFact: an LLM-based sentence-level entailment framework to score factual precision and recall and to evaluate grounding.

Develop MAIRA-2: a chest X‑ray specialised multimodal model (Vicuna 7B LLM + Rad-DINO-MAIRA-2 image encoder) that generates grounded and non-grounded reports.

Show MAIRA-2 is state-of-the-art on public findings benchmarks (MIMIC-CXR) and provide extensive ablations on the value of priors, lateral views, and report sections.

Release RadFact code and the grounded-report annotation protocol to support future work.

Key Findings

MAIRA-2 achieves strong lexical and clinical gains on MIMIC-CXR compared to earlier systems.

NumbersROUGE-L 38.4; BLEU-4 23.1; RadGraph-F1 34.6 (Table D.1)

RadFact shows MAIRA-2 still makes many sentence-level factual errors on MIMIC-CXR.

NumbersRadFact logical precision 52.9%, logical recall 48.2% on MIMIC-CXR (Table D.1)

MAIRA-2 can produce grounded reports with good grounding coverage but modest box accuracy.

NumbersGR-Bench grounding precision 68.8%, grounding recall 90.6%; spatial precision 33.5% (Table D.4)

MAIRA-2 generalises reasonably to an unseen dataset (IU-Xray) with higher sentence-level entailment scores than on MIMIC.

NumbersRadFact logical precision 71.4%, recall 67.6% on IU-Xray (Table D.3)

Human expert review finds most generated sentences acceptable but missed findings remain the dominant error.

Numbers20-case review: 123/135 sentences (91%) acceptable; 14/20 reports needed fewer than two corrections; 15 of 25 edits were

Including prior-study and Comparison text reduces hallucinated temporal comparisons and improves clinical metrics.

NumbersMAIRA-2 %Comparison mentions 85.6%; Infer:No Prior drops to 72.9% (Table D.9)

Providing lateral view and Technique reduces spurious lateral mentions and improves detection of lateral-dependent pathologies.

NumbersMAIRA-2 %Lateral mentions 39.6%; Infer:No Lat 13.2%; pleural effusion F1 drops 71.4→64.7 without lateral (text)

Results

ROUGE-L (MIMIC-CXR findings generation)

Value38.4 [37.9, 39.0] (median, 95% CI)

BaselineMAIRA-1 ROUGE-L 28.9

RadFact logical precision / recall (MIMIC-CXR)

Value52.9% / 48.2% (median)

Grounding precision / recall (GR-Bench)

Value68.8% / 90.6% (median)

RadFact spatial precision (GR-Bench)

Value33.5% (median)

RadFact logical precision / recall (IU-Xray held-out)

Value71.4% / 67.6% (median)

Who Should Care

What To Try In 7 Days

Run RadFact on your report-generation models to get sentence-level factuality signals.

If you have priors/lateral images, add them to model prompts and measure changes using ablations.

Collect a small grounded subset (100–500 studies) and test bounding-box-assisted review to estimate reviewer time savings.

Optimization Features

Token Efficiency

  • discretised box coordinate tokens (separate x/y tokens, N=100)

System Optimization

  • linear RoPE scaling to extend LLM context for multiple images

Training Optimization

  • multitask training on grounded and non-grounded examples
  • frozen image encoder with trained adapter and LLM fine-tuning

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • RadFact uses strict entailment and penalises partial descriptions; it does not weight clinical severity.
  • Bounding boxes are often imprecise (low spatial precision), limiting standalone automated use.
  • Grounded datasets are incomplete: GR-Bench lacks priors and PadChest-GR lacks report sections, limiting some analyses.
  • Qualitative review was mainly with a single radiologist, reducing generalisability.
  • USMix contains private data; full reproduction needs public grounded datasets or private access.

When Not To Use

  • Do not use MAIRA-2 outputs for autonomous clinical sign-off.
  • Avoid deploying without prior/lateral/technique inputs when those inputs matter for decisions.
  • Not suitable where exact box accuracy is required for intervention planning.

Failure Modes

  • Hallucinated temporal comparisons when prior is missing or not provided.
  • Spurious lateral mentions when lateral view or technique is absent.
  • Missed subtle findings (small effusion, small fractures, early consolidation).
  • Boxes associated with incorrect sentences (correct box, wrong finding) or boxes too large.

Core Entities

Models

  • MAIRA-2
  • MAIRA-1
  • Rad-DINO-MAIRA-2
  • Vicuna-7B-v1.5
  • Llama3-70B-Instruct
  • RadFact-Llama3

Metrics

  • RadFact (logical precision/recall, grounding/spatial)
  • RadGraph-F1
  • CheXbert / CheXpert F1
  • RadCliQ
  • ROUGE-L
  • BLEU-4

Datasets

  • MIMIC-CXR
  • PadChest
  • USMix
  • IU-Xray
  • GR-Bench
  • PadChest-GR
  • MS-CXR

Benchmarks

  • MIMIC-CXR findings generation
  • Grounded reporting (GR-Bench, PadChest-GR)

Context Entities

Models

  • GPT-4 (used for some preprocessing and sentence extraction)
  • Meta-Llama3-70B-Instruct (RadFact backbone)

Datasets

  • PadChest-GR (concurrently developed grounded dataset)
  • USMix (private dataset used for GR-Bench)