Factual Consistency Papers — Parsed & Scored for Practitioners

Practical survey of why LLMs hallucinate, how we measure it, and what fixes work today

0.70

0.50

0.60

233

Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.

Key finding

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Practical survey: taxonomy, causes, detection, benchmarks, and fixes for hallucination in LLMs

0.70

0.40

0.60

207

Hallucinations make LLM outputs untrustworthy for decisions or customer-facing answers; mapping causes to fixes helps reduce risk in search, chatbots, and recommendations.

Key finding

The paper redefines hallucination for LLMs into two main types: factuality (real-world fact mismatch) and faithfulness (deviation from instructions or context).

Augment ChatGPT with retrieved evidence and automated feedback to cut hallucinations

0.60

0.55

0.45

144

You can keep using a black-box LLM while reducing harmful hallucinations by adding retrieval, evidence consolidation, and automated feedback—improving factuality with modest engineering instead of costly fine-tuning.

Key finding

Retrieving consolidated evidence raises knowledge grounding (KF1) by about +10 points on news dialog.

Numbers: KF1: 26.71 -> 36.41 (ChatGPT -> LLM-AUGMENTER, News Chat, Table 1)

Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

0.50

0.60

0.40

85

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Key finding

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

Numbers: Arithmetic: 67.0% → 81.8% (Table 1)

Practical survey of what makes LLMs factual, how we test it, and how to fix it

0.60

0.40

0.60

52

LLMs are useful but make verifiable mistakes; businesses must add retrieval, verification, or domain tuning before using LLM outputs in advice, legal, medical, or financial workflows.

Key finding

Off-the-shelf LLMs often have low factual precision on long-form biographical text.

Numbers: FActScore range 42%–71% for commercial LLMs on biographies

ChatGPT can judge summary factuality zero‑shot but shows lexical bias, false reasoning, and prompt sensitivity

0.60

0.40

0.70

50

ChatGPT offers a ready-to-use, zero-shot factuality evaluator that can reduce annotation and training costs and often aligns better with human judgments, but it needs calibration for paraphrase-heavy or domain-specific text.

Key finding

ChatGPT (zero-shot + CoT) often matches or beats prior factuality metrics on multiple benchmarks.

Numbers: CoGenSum BA 74.3% vs SummaC ZS 70.4%; SummEval 83.3% vs 78.7% (Table 2)

Survey of how LLMs produce and spread factual errors—and what to do about it

0.40

0.35

0.55

33

LLMs can produce plausible-sounding falsehoods and leak sensitive inputs; unchecked use creates legal, reputational, and operational risk for any organization that relies on automated text.

Key finding

During COVID-era chatbot use, health topics were very common: 30% of 6,594 user-chatbot interactions used the keyword 'COVID-19'.

Numbers: 30% of 6,594 interactions

Human judges prefer LLM summaries; reference summaries often contain more hallucinations.

0.70

0.40

0.60

32

Zero-shot LLMs can produce higher-quality, more factual summaries than many human references and fine-tuned models, so businesses can often deploy LLM summarization directly and shift effort to dataset curation and verification.

Key finding

Human judges prefer LLM summaries over human-written and fine-tuned model summaries in pairwise comparisons.

Numbers: Human preference scores for LLMs exceed 50% across tasks (Figure 4).

Off-the-shelf abstractive models and LLMs score well on matching metrics but still hallucinate in legal judgment summaries

0.30

0.40

0.30

24

Abstractive models and LLMs can speed up drafting legal headnotes but still hallucinate people/dates/courts; use them for triage or first drafts with human review rather than final publication.

Key finding

Domain‑fine‑tuned abstractive models match expert summaries better than extractive models on measured metrics.

Numbers: ROUGE‑2 F1: LegLED‑IN 0.255 vs BertSum 0.2311 (Table 3)

Using a targeted RAG pipeline and curated CMU dataset to reduce LLM hallucinations on domain queries

0.30

0.40

0.50

19

Connecting an LLM to a curated domain knowledge base (RAG) gives measurable factual gains and is a practical first step before costly generator finetuning.

Key finding

Adding RAG boosts retrieval and answer quality over the baseline LLM.

Numbers: Recall 0.361 -> 0.409; F1 0.186 -> 0.289

Pretraining memory and corpus-frequency biases drive much of LLM hallucination on inference

0.30

0.50

0.20

18

LLMs can assert conclusions drawn from their training data or corpus statistics rather than the given context. That puts QA, summarization, and policy extraction at risk of silent misinformation; apply attestation checks and bias-controlled tests before deployment.

Key finding

Attestation (memorized sentence) strongly raises false positive entailments.

Numbers: False Entail chance 1.9x (LLaMA), 2.2x (GPT-3.5), 2.0x (PaLM)

At decode time, subtract earlier-layer logits from later-layer logits to reduce hallucinations.

0.70

0.55

0.15

17

DoLa boosts factual output from large pretrained LMs without retraining or external retrieval, giving immediate, low-cost improvements for truth-sensitive products like QA assistants and chatbots.

Key finding

DoLa raises combined truthfulness×informativeness on open-ended TruthfulQA by about 12–17 absolute percentage points for LLaMA models.

Numbers: 12–17 pp improvement on %Truth*Info across LLaMA sizes (Table 1)

Use an iterative generate-score-refine loop to cut hallucinated answers from medical LLMs

0.30

0.55

0.25

17

Adding an iterative generate-score-refine step reduces irrelevant and factually inconsistent medical answers, lowering risk and improving trust for AI assistants used in healthcare workflows.

Key finding

Iterative self-reflection raises Med-NLI sample entailment scores across models on PubMedQA.

Numbers: Vicuna: 0.4684 -> 0.6380 (+0.1696); ChatGPT: 0.5850 -> 0.6824 (+0.0974)

Survey: Can knowledge graphs reduce hallucinations in large language models?

0.60

0.50

0.70

16

Adding knowledge graphs to LLMs can cut factual errors quickly, especially for small models and domain tasks, improving trustworthiness without full model retraining.

Key finding

KG-augmented retrieval can dramatically improve QA correctness for small models.

Numbers: reported >80% answer correctness gain on QA (Baek et al.; Sen et al.; Wu et al.)

Break long model outputs into atomic facts and score the share supported by a knowledge source (FACTSCORE); an automatic estimator matches人s

0.70

0.60

0.70

14

FACTSCORE gives a concrete, scalable way to measure how much of a long model output is actually supported by a trusted source; use it to audit model factuality, compare model variants, and prioritize fixes where unsupported claims can cause harm or liability.

Key finding

Commercial LMs have low factual precision on people biographies

Numbers: FACTSCORE: InstructGPT 42.5%, ChatGPT 58.3%, PerplexityAI 71.5% (Table 1)

SUMMEDITS: a low-cost, reproducible benchmark showing most LLMs still fail at fine-grained factual consistency

0.60

0.50

0.70

12

Before relying on LLMs to flag factual errors, validate them on tough, domain-specific tests; cheap in-house benchmarks like SUMMEDITS catch real gaps and cut annotation cost dramatically.

Key finding

LLMs match or beat specialized methods on simple benchmarks but degrade on harder settings.

Numbers: FactCC GPT-4 balanced acc. 91.3% (Table 1); SUMMEDITS overall GPT-4 82.4% vs QAFactEval 65.7% (Table 9)

Head-to-Tail: a 18K-question benchmark showing LLMs are far from perfect on factual knowledge, especially long-tail facts.

0.30

0.60

0.50

12

LLMs do not reliably store factual knowledge: product features that assume accurate factual recall (search, knowledge APIs, assistants) should keep symbolic knowledge sources or retrieval layers for long-tail and critical facts.

Key finding

Best overall QA accuracy on Head-to-Tail is low.

Numbers: GPT-4 ALM = 30.9% (Table 3)

FELM: a fine‑grained benchmark that tests factuality detectors across five domains

0.30

0.40

0.30

12

Automated factuality checks are needed: one in three long ChatGPT responses in this benchmark contains an error, and current LLM-only detectors miss many mistakes—so businesses should add retrieval and human oversight before trusting model outputs.

Key finding

FELM covers five realistic domains and contains thousands of fine‑grained segments.

Numbers: 847 samples, 4,425 segments; avg response 89.1 tokens

A concise, up-to-date roadmap of text summarization research before and during the LLM era

0.70

0.30

0.80

11

LLMs let teams deploy usable summaries quickly with zero/few-shot prompts, but hallucination and unreliable automatic metrics mean businesses must pair LLMs with retrieval, human checks, or smaller fine-tuned models for safety.

Key finding

LLMs shift summarization to zero- and few-shot settings and often produce human-preferred summaries in human studies.

Numbers: Human studies report annotator preference for GPT-3/GPT-4 summaries (multiple papers cited)

Use an LLM to spot its own factual claims and auto-check them against Wikidata to cut hallucinations

0.60

0.50

10

KGR can reduce factual errors in model outputs, especially for multi-step reasoning tasks, lowering risk in customer-facing answers and automated reporting without retraining large models.

Key finding

KGR raises ChatGPT F1 on Mintaka (complex reasoning) by about 6.2 points over question-relevant KG retrieval (QKR).

Numbers: ChatGPT Mintaka F1: QKR 54.6 -> KGR 60.8 (+6.2)

REFEED: refine LLM outputs by retrieving documents about the model's own answers

0.60

0.50

0.75

10

You can improve factual accuracy of LLM outputs at inference time without costly fine-tuning by adding a retrieval-feedback loop that conditions retrieval on model answers.

Key finding

REFEED improves open-domain QA accuracy over retrieve-then-read baselines in zero-shot experiments.

Numbers: +~6% overall (reported) zero-shot improvement

ALCE: a reproducible benchmark and metrics to make LLM answers cite their sources

0.60

10

If you build customer-facing assistants, ALCE gives a reproducible way to measure whether answers are supported by sources and helps reduce user mistrust from hallucinations.

Key finding

Many best-performing models still fail to fully support their answers with cited passages on open-ended questions.

Numbers: ≈50% of generations lack full citation support on ELI5 (ChatGPT/GPT-4)

A practical survey and benchmark that measures factuality, robustness, fairness, transparency, accountability and privacy in RAG systems.

0.40

0.30

9

RAG systems can improve factual answers but also introduce privacy leaks, bias and brittle behavior; measuring those risks with a practical benchmark helps choose models and safeguards before production.

Key finding

Proprietary models outperform most open-source models on trustworthiness metrics.

Numbers: GPT-3.5 factuality=40 vs Llama2-13b-chat=4 (Table 2)

A practical review of how PLMs and LLMs drive biomedical text summarization and where they still fail

0.40

0.50

0.60

9

Automated summarization can cut clinician time and speed literature review, but current models still make factual errors; businesses should combine domain-adapted PLMs or LLM prompting with verification steps before clinical use.

Key finding

Domain-adapted PLMs give the best extractive results on PubMed.

Numbers: PubMed-short ROUGE-1: KeBioSum 43.98 vs TextRank 38.15