Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

May 30, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Juraj Vladika, Annika Domres, Mai Nguyen, Rebecca Moser, Jana Nano, Felix Busch, Lisa C. Adams, Keno K. Bressem, Denise Bernhardt, Stephanie E. Combs, Kai J. Borm, Florian Matthes, Jan C. Peeken

Links

Abstract / PDF

Why It Matters For Business

Per-claim fact checking makes medical LLM outputs more trustworthy and traceable, reducing clinical risk and improving compliance with guideline-driven standards while enabling on-prem deployment with smaller models.

Summary TLDR

The authors introduce an "atomic fact-checking" pipeline that breaks long LLM answers into small verifiable facts, retrieves guideline passages per fact, labels each fact as TRUE/FALSE, rewrites false facts, and repeats up to three times. Evaluated on three custom medical Q&A sets and the AMEGA benchmark, the pipeline improved final answer quality (examples: +20% validation, +12% test, +40% tumor-board), raised explainability by linking facts to guideline chunks, and reduced false positive corrected facts after looping. Code and datasets are public.

Problem Statement

LLMs can give plausible but incorrect medical statements (hallucinations) and lack fact-level traceability. Standard RAG grounds whole answers in documents but cannot verify and correct each individual claim. This paper seeks a practical, post-hoc method to detect, correct, and trace each atomic fact in long-form medical answers without model retraining.

Main Contribution

An end-to-end atomic fact-checking pipeline for RAG systems that: splits answers into atomic facts, retrieves evidence per fact, labels veracity, rewrites false facts, and loops corrections up to three times.

A medical guideline vector knowledge base (S-PubMedBERT embeddings + ChromaDB) and concrete prompting strategy (four-shot in-context examples) tuned for veracity detection and rewriting.

Extensive evaluation: multi-reader human annotation on three novel medical Q&A sets plus automated AMEGA benchmark; comparison to Self-Refine and multiple open/closed LLMs.

Key Findings

Atomic fact-checking improved final answer quality in the hardest (tumor-board) set by 40%.

NumbersOverall improved answers: validation 20%, test 12%, tumor-board 40%

Detecting and correcting facts through looping reduced false positive rewritten facts to near zero after three iterations.

NumbersAvg false positive rate fell from 2% → 1% → 0% across 3 iterations

Fact-checking significantly improved AMEGA auto-evaluation scores across most LLMs.

NumbersAMEGA: p<0.001 for 8/11 models, p<0.01 for 2/11, p<0.05 for 1/11

Smaller open-source models benefit more from atomic decomposition than from large single-pass self-correction.

NumbersFact-checking gave larger AMEGA gains for smaller models (e.g., Llama 3.2 3B +4.3) vs. Self-Refine baseline

Per-fact retrieval using Chain-of-Thought prompts found the correct guideline chunk 75% as top-1 and 91.9% in top-3.

NumbersCoT top-1 75%, top-3 91.9%

Results

Accuracy

Value87.3%

Accuracy

Value73.7%

Accuracy

Value91%

Overall answer quality improved

ValueValidation 20%; Test 12%; Tumor board 40%

BaselineInitial RAG answers

AMEGA score change (examples)

ValueGPT-4o +1.1, GPT-4o-mini +1.0, MedGemma 27B +2.4, Llama 3.2 3B +4.3

BaselineInitial RAG on AMEGA

Accuracy

ValueTop-1 75%; Top-3 91.9%

Baselinecosine similarity with text embedding

Who Should Care

What To Try In 7 Days

Build a small guideline vector DB (PDF→text→chunks→embeddings) and serve with ChromaDB.

Run one LLM (even a small open-source model) to split sample answers into atomic facts and retrieve evidence per fact.

Implement a single-loop fact-veracity check and rewrite step; measure improvement on 20–50 representative clinical Q&A cases.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Small curated evaluation: 175 total Q&As limit broad generalization across specialties.
  • Pipeline relies on multiple LLM generations; errors can propagate between steps.
  • Token usage ~10× higher than single-pass RAG, increasing cost and latency.
  • Knowledge base limited to provided oncology guidelines; not real-time or web-updated.

When Not To Use

  • When latency or API cost is tightly constrained (token use is ~10× higher).
  • For tasks requiring up-to-the-minute web knowledge not present in local guidelines.
  • When you cannot curate or trust an authoritative guideline corpus for the domain.

Failure Modes

  • Retrieval of less relevant chunks causes rewritten answers to worsen (observed in 4–8% cases).
  • LLM errors in atomic splitting or rewriting can introduce new hallucinations.
  • Over-reliance on a limited guideline set misses domain variants or local practice differences.

Core Entities

Models

  • GPT-4o (gpt-4o-2024-11-20)
  • GPT-4o-mini
  • MedGemma 27B
  • Gemma 3 27B
  • Llama 3 70B
  • OpenBioLLM 70B
  • Qwen 3 32B
  • Qwen 3 Medical 32B
  • Llama 3.2 3B
  • Mistral 24B

Metrics

  • sensitivity
  • specificity
  • precision (PPV)
  • F1-score
  • Accuracy
  • overall answer quality (improved/equal/worse)
  • hallucination detection rate

Datasets

  • Validation Q&A (50, prostate)
  • Test Q&A (60, 30 prostate + 30 breast)
  • Tumor board cases (80 anonymized)
  • Neurology Q&A (external)
  • AMEGA benchmark (auto-eval, 137 Qs across 20 domains)

Benchmarks

  • AMEGA