Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Per-claim fact checking makes medical LLM outputs more trustworthy and traceable, reducing clinical risk and improving compliance with guideline-driven standards while enabling on-prem deployment with smaller models.
Summary TLDR
The authors introduce an "atomic fact-checking" pipeline that breaks long LLM answers into small verifiable facts, retrieves guideline passages per fact, labels each fact as TRUE/FALSE, rewrites false facts, and repeats up to three times. Evaluated on three custom medical Q&A sets and the AMEGA benchmark, the pipeline improved final answer quality (examples: +20% validation, +12% test, +40% tumor-board), raised explainability by linking facts to guideline chunks, and reduced false positive corrected facts after looping. Code and datasets are public.
Problem Statement
LLMs can give plausible but incorrect medical statements (hallucinations) and lack fact-level traceability. Standard RAG grounds whole answers in documents but cannot verify and correct each individual claim. This paper seeks a practical, post-hoc method to detect, correct, and trace each atomic fact in long-form medical answers without model retraining.
Main Contribution
An end-to-end atomic fact-checking pipeline for RAG systems that: splits answers into atomic facts, retrieves evidence per fact, labels veracity, rewrites false facts, and loops corrections up to three times.
A medical guideline vector knowledge base (S-PubMedBERT embeddings + ChromaDB) and concrete prompting strategy (four-shot in-context examples) tuned for veracity detection and rewriting.
Extensive evaluation: multi-reader human annotation on three novel medical Q&A sets plus automated AMEGA benchmark; comparison to Self-Refine and multiple open/closed LLMs.
Key Findings
Atomic fact-checking improved final answer quality in the hardest (tumor-board) set by 40%.
Detecting and correcting facts through looping reduced false positive rewritten facts to near zero after three iterations.
Fact-checking significantly improved AMEGA auto-evaluation scores across most LLMs.
Smaller open-source models benefit more from atomic decomposition than from large single-pass self-correction.
Per-fact retrieval using Chain-of-Thought prompts found the correct guideline chunk 75% as top-1 and 91.9% in top-3.
Results
Accuracy
Accuracy
Accuracy
Overall answer quality improved
AMEGA score change (examples)
Accuracy
Who Should Care
What To Try In 7 Days
Build a small guideline vector DB (PDF→text→chunks→embeddings) and serve with ChromaDB.
Run one LLM (even a small open-source model) to split sample answers into atomic facts and retrieve evidence per fact.
Implement a single-loop fact-veracity check and rewrite step; measure improvement on 20–50 representative clinical Q&A cases.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Small curated evaluation: 175 total Q&As limit broad generalization across specialties.
- Pipeline relies on multiple LLM generations; errors can propagate between steps.
- Token usage ~10× higher than single-pass RAG, increasing cost and latency.
- Knowledge base limited to provided oncology guidelines; not real-time or web-updated.
When Not To Use
- When latency or API cost is tightly constrained (token use is ~10× higher).
- For tasks requiring up-to-the-minute web knowledge not present in local guidelines.
- When you cannot curate or trust an authoritative guideline corpus for the domain.
Failure Modes
- Retrieval of less relevant chunks causes rewritten answers to worsen (observed in 4–8% cases).
- LLM errors in atomic splitting or rewriting can introduce new hallucinations.
- Over-reliance on a limited guideline set misses domain variants or local practice differences.
Core Entities
Models
- GPT-4o (gpt-4o-2024-11-20)
- GPT-4o-mini
- MedGemma 27B
- Gemma 3 27B
- Llama 3 70B
- OpenBioLLM 70B
- Qwen 3 32B
- Qwen 3 Medical 32B
- Llama 3.2 3B
- Mistral 24B
Metrics
- sensitivity
- specificity
- precision (PPV)
- F1-score
- Accuracy
- overall answer quality (improved/equal/worse)
- hallucination detection rate
Datasets
- Validation Q&A (50, prostate)
- Test Q&A (60, 30 prostate + 30 breast)
- Tumor board cases (80 anonymized)
- Neurology Q&A (external)
- AMEGA benchmark (auto-eval, 137 Qs across 20 domains)
Benchmarks
- AMEGA

