MAIRA-2: a multimodal chest X‑ray model that generates grounded findings and RadFact, an LLM-based sentence-level evaluator

June 6, 20248 min

Overview

Decision SnapshotNeeds Validation

MAIRA-2 and RadFact are practically useful for draft-generation and model evaluation; however, sentence-level factual errors and missed findings mean human review is required before clinical use.

Citations13

Evidence Strength0.75

Confidence0.90

Risk Signals12

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 40%

Novelty: 60%

Authors

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, Stephanie L. Hyland

Links

Abstract / PDF / Code

Why It Matters For Business

MAIRA-2 can produce editable, locally-grounded draft radiology findings and an LLM-based evaluator (RadFact); this shortens reviewer effort and supports rapid prototyping of clinical draft-assist tools, but human oversight remains mandatory.

Who Should Care

Summary TLDR

MAIRA-2 is a chest X‑ray multimodal system that (1) generates the full Findings text and (optionally) a bounding box for each finding (grounded reporting), (2) integrates richer report context (lateral view, prior studies, Indication, Technique, Comparison), and (3) introduces RadFact, an LLM-based sentence-level entailment evaluator for factuality and grounding. MAIRA-2 sets a new state-of-the-art on MIMIC-CXR findings generation and demonstrates feasible grounded reporting, but RadFact and clinician review show sizable remaining error rates and missed findings, so outputs are currently useful as draft reports requiring human review.

Problem Statement

Automatically drafting radiology Findings must be both clinically correct and verifiable. Existing systems either hallucinate, omit findings, or lack ways to localise and verify each reported observation. The paper aims to (a) add grounding (per-sentence bounding boxes) to report generation and (b) create an evaluation method that checks factuality at the sentence level without hard-coded finding classes.

Main Contribution

Define grounded radiology reporting: each finding sentence can include bounding box(es) locating that finding on the image.

Introduce RadFact: an LLM-based sentence-level entailment framework to score factual precision and recall and to evaluate grounding.

Key Findings

MAIRA-2 achieves strong lexical and clinical gains on MIMIC-CXR compared to earlier systems.

NumbersROUGE-L 38.4; BLEU-4 23.1; RadGraph-F1 34.6 (Table D.1)

Practical UseExpect better word-level matches and improved clinical label scores when using MAIRA-2 outputs as a draft compared to prior models.

Evidence RefResults, Table D.1

RadFact shows MAIRA-2 still makes many sentence-level factual errors on MIMIC-CXR.

NumbersRadFact logical precision 52.9%, logical recall 48.2% on MIMIC-CXR (Table D.1)

Practical UseTreat outputs as first drafts that need expert verification; do not use for unsupervised clinical sign-off.

Evidence RefResults, Table D.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE-L (MIMIC-CXR findings generation)38.4 [37.9, 39.0] (median, 95% CI)MAIRA-1 ROUGE-L 28.9+9.5 absoluteMIMIC-CXR testTable D.1Table D.1
RadFact logical precision / recall (MIMIC-CXR)52.9% / 48.2% (median)MIMIC-CXR testTable D.1Table D.1

What To Try In 7 Days

Run RadFact on your report-generation models to get sentence-level factuality signals.

If you have priors/lateral images, add them to model prompts and measure changes using ablations.

Collect a small grounded subset (100–500 studies) and test bounding-box-assisted review to estimate reviewer time savings.

Optimization Features

Token Efficiency
discretised box coordinate tokens (separate x/y tokens, N=100)
System Optimization
linear RoPE scaling to extend LLM context for multiple images
Training Optimization
multitask training on grounded and non-grounded examplesfrozen image encoder with trained adapter and LLM fine-tuning

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

RadFact uses strict entailment and penalises partial descriptions; it does not weight clinical severity.

Bounding boxes are often imprecise (low spatial precision), limiting standalone automated use.

When Not To Use

Do not use MAIRA-2 outputs for autonomous clinical sign-off.

Avoid deploying without prior/lateral/technique inputs when those inputs matter for decisions.

Failure Modes

Hallucinated temporal comparisons when prior is missing or not provided.

Spurious lateral mentions when lateral view or technique is absent.

Core Entities

Models

MAIRA-2MAIRA-1Rad-DINO-MAIRA-2Vicuna-7B-v1.5Llama3-70B-InstructRadFact-Llama3

Metrics

RadFact (logical precision/recall, grounding/spatial)RadGraph-F1CheXbert / CheXpert F1RadCliQROUGE-LBLEU-4

Datasets

MIMIC-CXRPadChestUSMixIU-XrayGR-BenchPadChest-GRMS-CXR

Benchmarks

MIMIC-CXR findings generationGrounded reporting (GR-Bench, PadChest-GR)

Context Entities

Models

GPT-4 (used for some preprocessing and sentence extraction)Meta-Llama3-70B-Instruct (RadFact backbone)

Datasets

PadChest-GR (concurrently developed grounded dataset)USMix (private dataset used for GR-Bench)