Overview
Practical and reproducible: runs on standard LLM APIs and an embedded guideline DB, but costs and latency rise (≈10× token use) and results depend on retrieval quality.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Per-claim fact checking makes medical LLM outputs more trustworthy and traceable, reducing clinical risk and improving compliance with guideline-driven standards while enabling on-prem deployment with smaller models.
Who Should Care
Summary TLDR
The authors introduce an "atomic fact-checking" pipeline that breaks long LLM answers into small verifiable facts, retrieves guideline passages per fact, labels each fact as TRUE/FALSE, rewrites false facts, and repeats up to three times. Evaluated on three custom medical Q&A sets and the AMEGA benchmark, the pipeline improved final answer quality (examples: +20% validation, +12% test, +40% tumor-board), raised explainability by linking facts to guideline chunks, and reduced false positive corrected facts after looping. Code and datasets are public.
Problem Statement
LLMs can give plausible but incorrect medical statements (hallucinations) and lack fact-level traceability. Standard RAG grounds whole answers in documents but cannot verify and correct each individual claim. This paper seeks a practical, post-hoc method to detect, correct, and trace each atomic fact in long-form medical answers without model retraining.
Main Contribution
An end-to-end atomic fact-checking pipeline for RAG systems that: splits answers into atomic facts, retrieves evidence per fact, labels veracity, rewrites false facts, and loops corrections up to three times.
A medical guideline vector knowledge base (S-PubMedBERT embeddings + ChromaDB) and concrete prompting strategy (four-shot in-context examples) tuned for veracity detection and rewriting.
Key Findings
Atomic fact-checking improved final answer quality in the hardest (tumor-board) set by 40%.
Detecting and correcting facts through looping reduced false positive rewritten facts to near zero after three iterations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 87.3% | — | — | Validation (50 Q&A) | Table 1 (human evaluation) | Table 1 |
| Accuracy | 73.7% | — | — | Test (60 Q&A) | Table 1 (human evaluation) | Table 1 |
What To Try In 7 Days
Build a small guideline vector DB (PDF→text→chunks→embeddings) and serve with ChromaDB.
Run one LLM (even a small open-source model) to split sample answers into atomic facts and retrieve evidence per fact.
Implement a single-loop fact-veracity check and rewrite step; measure improvement on 20–50 representative clinical Q&A cases.
Reproducibility
Risks & Boundaries
Limitations
Small curated evaluation: 175 total Q&As limit broad generalization across specialties.
Pipeline relies on multiple LLM generations; errors can propagate between steps.
When Not To Use
When latency or API cost is tightly constrained (token use is ~10× higher).
For tasks requiring up-to-the-minute web knowledge not present in local guidelines.
Failure Modes
Retrieval of less relevant chunks causes rewritten answers to worsen (observed in 4–8% cases).
LLM errors in atomic splitting or rewriting can introduce new hallucinations.

