Practical survey of what makes LLMs factual, how we test it, and how to fix it

October 11, 20238 min

Overview

Decision SnapshotReady For Pilot

The survey synthesizes many validated studies and benchmarks; its practical guidance is strong for engineers but many enhancement claims depend on specific datasets and setup.

Citations52

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Cunxiang Wang, Xiaoze Liu, Yuanhao Yue, Xiangru Tang, Tianhang Zhang, Cheng Jiayang, Yunzhi Yao, Wenyang Gao, Xuming Hu, Zehan Qi, Yidong Wang, Linyi Yang, Jindong Wang, Xing Xie, Zheng Zhang, Yue Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs are useful but make verifiable mistakes; businesses must add retrieval, verification, or domain tuning before using LLM outputs in advice, legal, medical, or financial workflows.

Who Should Care

Summary TLDR

This 62-page survey defines factuality for LLMs, reviews how researchers measure it (metrics and benchmarks), analyzes why LLMs make factual errors (model, retrieval, and decoding causes), and catalogs practical fixes (pretraining, fine-tuning, retrieval-augmentation, agents, prompts, decoding tweaks). It highlights gaps in evaluation, trade-offs between parametric and retrieved knowledge, and how domain models (medicine, law, finance) apply continual training or external knowledge to reduce errors. The authors maintain an open repo of resources.

Problem Statement

Large language models often produce outputs that contradict reliable facts. This survey asks: how do we define and measure factuality, why do errors happen inside LLMs and retrieval pipelines, and what practical methods improve factual accuracy—especially for domain-specific uses?

Main Contribution

Clear, practical definition and taxonomy of factuality vs hallucination and related concepts.

Comprehensive review of evaluation metrics and benchmarks for factuality across general and domain-specific tasks.

Key Findings

Off-the-shelf LLMs often have low factual precision on long-form biographical text.

NumbersFActScore range 42%–71% for commercial LLMs on biographies

Practical UseDo not trust free-form biographies from LLMs without verification; add retrieval or atomic-fact checks before use.

Evidence RefSec 3.1.3; Table 3 (FActScore summary)

Top LLMs show strong gains in standardized benchmarks but still make many factual errors.

NumbersMMLU (5-shot) GPT-4: 86.4% vs GPT-3.5: 70%

Practical UseHigh benchmark scores signal competence but not perfect factual safety; validate outputs for critical decisions.

Evidence RefTable 4 (MMLU results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4: 86.4%GPT-3.5: 70%MMLU (selected tasks)Table 4 reports MMLU 5-shot scoresTable 4
TruthfulQA (MC1 / %true)GPT-4 ~29 (MC1); LLaMA2-70B 53.37% true (other metric variants)GPT-3.5 ~28 (MC1)TruthfulQATable 4 TruthfulQA entriesTable 4

What To Try In 7 Days

Add a simple RAG layer: index company docs and return top 3 passages alongside model outputs.

Run FActScore-like atomic checks on a sample of generated content to quantify error types.

Use prompting patterns that ask the model to cite sources and then verify citations automatically.

Agent Features

Memory
retrieval memory (vector DB)external entity mention memory (TOME, KALA)
Planning
chain-of-thought / verification chainsdynamic retrieval decisions (FLARE)
Tool Use
web search / search APIsdatabase & KG queriesentity extraction + validation tools
Frameworks
ReActChain-of-VerificationSelf-RAGReflextion
Architectures
multi-agent debate (LM-vs-LM)ReAct agent loop (reasoning + actions)
Collaboration
cross-model examination (LM vs LM)multi-agent debating for verification

Optimization Features

Training Optimization
deduplication of pretraining datainformative-token masking (PMI-based)
Inference Optimization
inference-time interventions (activation steering)factual-nucleus samplingdecoding-by-contrasting-layers (DoLa)

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Heterogeneous evaluations: metrics, datasets, and settings differ, so cross-paper numbers are not directly comparable.

Many improvement results rely on controlled benchmarks; real-world robustness (noisy retrieval, adversaries) is less evaluated.

When Not To Use

When you need provable, auditable facts without any external verification step.

For safety-critical decisions without a validated domain-specific fine-tune and retrieval pipeline.

Failure Modes

Model overconfidence and failure to recognize unknowns (false certainty).

Retrieval noise or contradictory documents leading to incorrect grounding.

Core Entities

Models

GPT-4ChatGPT (GPT-3.5)GPT-3LLaMALLaMA-2AlpacaVicunaBloombergGPTHuatuoGPTGopherPaLM

Metrics

Exact MatchAccuracyF1QUIP-ScoreFActScore (atomic-fact %)Calibration / Brier%Truth * Info (truthfulness × informativeness)Human evaluation (AIS)

Datasets

MMLUTruthfulQANaturalQuestions (NQ)TriviaQAFActScore (biographies)QUIP / QUIP-ScoreHaluEvalC-EvalFreshQAFLAREHuatuo-26M

Benchmarks

MMLUTruthfulQABigBenchHaluEvalFreshQAC-EvalPinocchioRealTimeQASelfAwareFActScore

Context Entities

Models

RetroT5InstructGPTCodexPaLM-540B

Metrics

ROUGEBLEUBERTScoreBLEURTBARTScore

Datasets

HotpotQAELI5KILTNaturalQuestionsGeneTuring

Benchmarks

BigBench HardPinocchioOceanBench