Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.35
Citation Count
13
Why It Matters For Business
MAIRA-2 can produce editable, locally-grounded draft radiology findings and an LLM-based evaluator (RadFact); this shortens reviewer effort and supports rapid prototyping of clinical draft-assist tools, but human oversight remains mandatory.
Summary TLDR
MAIRA-2 is a chest X‑ray multimodal system that (1) generates the full Findings text and (optionally) a bounding box for each finding (grounded reporting), (2) integrates richer report context (lateral view, prior studies, Indication, Technique, Comparison), and (3) introduces RadFact, an LLM-based sentence-level entailment evaluator for factuality and grounding. MAIRA-2 sets a new state-of-the-art on MIMIC-CXR findings generation and demonstrates feasible grounded reporting, but RadFact and clinician review show sizable remaining error rates and missed findings, so outputs are currently useful as draft reports requiring human review.
Problem Statement
Automatically drafting radiology Findings must be both clinically correct and verifiable. Existing systems either hallucinate, omit findings, or lack ways to localise and verify each reported observation. The paper aims to (a) add grounding (per-sentence bounding boxes) to report generation and (b) create an evaluation method that checks factuality at the sentence level without hard-coded finding classes.
Main Contribution
Define grounded radiology reporting: each finding sentence can include bounding box(es) locating that finding on the image.
Introduce RadFact: an LLM-based sentence-level entailment framework to score factual precision and recall and to evaluate grounding.
Develop MAIRA-2: a chest X‑ray specialised multimodal model (Vicuna 7B LLM + Rad-DINO-MAIRA-2 image encoder) that generates grounded and non-grounded reports.
Show MAIRA-2 is state-of-the-art on public findings benchmarks (MIMIC-CXR) and provide extensive ablations on the value of priors, lateral views, and report sections.
Release RadFact code and the grounded-report annotation protocol to support future work.
Key Findings
MAIRA-2 achieves strong lexical and clinical gains on MIMIC-CXR compared to earlier systems.
RadFact shows MAIRA-2 still makes many sentence-level factual errors on MIMIC-CXR.
MAIRA-2 can produce grounded reports with good grounding coverage but modest box accuracy.
MAIRA-2 generalises reasonably to an unseen dataset (IU-Xray) with higher sentence-level entailment scores than on MIMIC.
Human expert review finds most generated sentences acceptable but missed findings remain the dominant error.
Including prior-study and Comparison text reduces hallucinated temporal comparisons and improves clinical metrics.
Providing lateral view and Technique reduces spurious lateral mentions and improves detection of lateral-dependent pathologies.
Results
ROUGE-L (MIMIC-CXR findings generation)
RadFact logical precision / recall (MIMIC-CXR)
Grounding precision / recall (GR-Bench)
RadFact spatial precision (GR-Bench)
RadFact logical precision / recall (IU-Xray held-out)
Who Should Care
What To Try In 7 Days
Run RadFact on your report-generation models to get sentence-level factuality signals.
If you have priors/lateral images, add them to model prompts and measure changes using ablations.
Collect a small grounded subset (100–500 studies) and test bounding-box-assisted review to estimate reviewer time savings.
Optimization Features
Token Efficiency
- discretised box coordinate tokens (separate x/y tokens, N=100)
System Optimization
- linear RoPE scaling to extend LLM context for multiple images
Training Optimization
- multitask training on grounded and non-grounded examples
- frozen image encoder with trained adapter and LLM fine-tuning
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- RadFact uses strict entailment and penalises partial descriptions; it does not weight clinical severity.
- Bounding boxes are often imprecise (low spatial precision), limiting standalone automated use.
- Grounded datasets are incomplete: GR-Bench lacks priors and PadChest-GR lacks report sections, limiting some analyses.
- Qualitative review was mainly with a single radiologist, reducing generalisability.
- USMix contains private data; full reproduction needs public grounded datasets or private access.
When Not To Use
- Do not use MAIRA-2 outputs for autonomous clinical sign-off.
- Avoid deploying without prior/lateral/technique inputs when those inputs matter for decisions.
- Not suitable where exact box accuracy is required for intervention planning.
Failure Modes
- Hallucinated temporal comparisons when prior is missing or not provided.
- Spurious lateral mentions when lateral view or technique is absent.
- Missed subtle findings (small effusion, small fractures, early consolidation).
- Boxes associated with incorrect sentences (correct box, wrong finding) or boxes too large.
Core Entities
Models
- MAIRA-2
- MAIRA-1
- Rad-DINO-MAIRA-2
- Vicuna-7B-v1.5
- Llama3-70B-Instruct
- RadFact-Llama3
Metrics
- RadFact (logical precision/recall, grounding/spatial)
- RadGraph-F1
- CheXbert / CheXpert F1
- RadCliQ
- ROUGE-L
- BLEU-4
Datasets
- MIMIC-CXR
- PadChest
- USMix
- IU-Xray
- GR-Bench
- PadChest-GR
- MS-CXR
Benchmarks
- MIMIC-CXR findings generation
- Grounded reporting (GR-Bench, PadChest-GR)
Context Entities
Models
- GPT-4 (used for some preprocessing and sentence extraction)
- Meta-Llama3-70B-Instruct (RadFact backbone)
Datasets
- PadChest-GR (concurrently developed grounded dataset)
- USMix (private dataset used for GR-Bench)

