Practical survey of what makes LLMs factual, how we test it, and how to fix it

Overview

Decision SnapshotReady For Pilot

The survey synthesizes many validated studies and benchmarks; its practical guidance is strong for engineers but many enhancement claims depend on specific datasets and setup.

Citations52

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, Yue Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs are useful but make verifiable mistakes; businesses must add retrieval, verification, or domain tuning before using LLM outputs in advice, legal, medical, or financial workflows.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

This 62-page survey defines factuality for LLMs, reviews how researchers measure it (metrics and benchmarks), analyzes why LLMs make factual errors (model, retrieval, and decoding causes), and catalogs practical fixes (pretraining, fine-tuning, retrieval-augmentation, agents, prompts, decoding tweaks). It highlights gaps in evaluation, trade-offs between parametric and retrieved knowledge, and how domain models (medicine, law, finance) apply continual training or external knowledge to reduce errors. The authors maintain an open repo of resources.

Problem Statement

Large language models often produce outputs that contradict reliable facts. This survey asks: how do we define and measure factuality, why do errors happen inside LLMs and retrieval pipelines, and what practical methods improve factual accuracy—especially for domain-specific uses?

Main Contribution

Clear, practical definition and taxonomy of factuality vs hallucination and related concepts.

Comprehensive review of evaluation metrics and benchmarks for factuality across general and domain-specific tasks.

Key Findings

Off-the-shelf LLMs often have low factual precision on long-form biographical text.

NumbersFActScore range 42%–71% for commercial LLMs on biographies

Practical UseDo not trust free-form biographies from LLMs without verification; add retrieval or atomic-fact checks before use.

Evidence RefSec 3.1.3; Table 3 (FActScore summary)

Top LLMs show strong gains in standardized benchmarks but still make many factual errors.

NumbersMMLU (5-shot) GPT-4: 86.4% vs GPT-3.5: 70%

Practical UseHigh benchmark scores signal competence but not perfect factual safety; validate outputs for critical decisions.

Evidence RefTable 4 (MMLU results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4: 86.4%	GPT-3.5: 70%	—	MMLU (selected tasks)	Table 4 reports MMLU 5-shot scores	Table 4
TruthfulQA (MC1 / %true)	GPT-4 ~29 (MC1); LLaMA2-70B 53.37% true (other metric variants)	GPT-3.5 ~28 (MC1)	—	TruthfulQA	Table 4 TruthfulQA entries	Table 4

What To Try In 7 Days

Add a simple RAG layer: index company docs and return top 3 passages alongside model outputs.

Run FActScore-like atomic checks on a sample of generated content to quantify error types.

Use prompting patterns that ask the model to cite sources and then verify citations automatically.

Agent Features

Memory

retrieval memory (vector DB)external entity mention memory (TOME, KALA)

Planning

chain-of-thought / verification chainsdynamic retrieval decisions (FLARE)

Tool Use

web search / search APIsdatabase & KG queriesentity extraction + validation tools

Frameworks

ReActChain-of-VerificationSelf-RAGReflextion

Architectures

multi-agent debate (LM-vs-LM)ReAct agent loop (reasoning + actions)

Collaboration

cross-model examination (LM vs LM)multi-agent debating for verification

Optimization Features

Training Optimization

deduplication of pretraining datainformative-token masking (PMI-based)

Inference Optimization

inference-time interventions (activation steering)factual-nucleus samplingdecoding-by-contrasting-layers (DoLa)

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/wangcunxiang/LLM-Factuality-Survey

Risks & Boundaries

Limitations

Heterogeneous evaluations: metrics, datasets, and settings differ, so cross-paper numbers are not directly comparable.

Many improvement results rely on controlled benchmarks; real-world robustness (noisy retrieval, adversaries) is less evaluated.

When Not To Use

When you need provable, auditable facts without any external verification step.

For safety-critical decisions without a validated domain-specific fine-tune and retrieval pipeline.

Failure Modes

Model overconfidence and failure to recognize unknowns (false certainty).

Retrieval noise or contradictory documents leading to incorrect grounding.

Core Entities

Models

GPT-4ChatGPT (GPT-3.5)GPT-3LLaMALLaMA-2AlpacaVicunaBloombergGPTHuatuoGPTGopherPaLM

Metrics

Exact MatchAccuracyF1QUIP-ScoreFActScore (atomic-fact %)Calibration / Brier%Truth * Info (truthfulness × informativeness)Human evaluation (AIS)

Datasets

MMLUTruthfulQANaturalQuestions (NQ)TriviaQAFActScore (biographies)QUIP / QUIP-ScoreHaluEvalC-EvalFreshQAFLAREHuatuo-26M

Benchmarks

MMLUTruthfulQABigBenchHaluEvalFreshQAC-EvalPinocchioRealTimeQASelfAwareFActScore

Context Entities

Models

RetroT5InstructGPTCodexPaLM-540B

Metrics

ROUGEBLEUBERTScoreBLEURTBARTScore

Datasets

HotpotQAELI5KILTNaturalQuestionsGeneTuring

Benchmarks

BigBench HardPinocchioOceanBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Off-the-shelf LLMs often have low factual precision on long-form biographical text.

Top LLMs show strong gains in standardized benchmarks but still make many factual errors.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

SimpleVQA — a 2,025-sample bilingual VQA benchmark that tests multimodal LLM factuality with atomic-fact probes

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

LLMs (GPT-3.5, GPT-4, PaLM-2) do not reliably judge factuality on the FRANK benchmark

Key finding