Overview
The survey synthesizes many validated studies and benchmarks; its practical guidance is strong for engineers but many enhancement claims depend on specific datasets and setup.
Citations52
Evidence Strength0.75
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/5
Reproducibility
Status: Partial assets available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
LLMs are useful but make verifiable mistakes; businesses must add retrieval, verification, or domain tuning before using LLM outputs in advice, legal, medical, or financial workflows.
Who Should Care
Summary TLDR
This 62-page survey defines factuality for LLMs, reviews how researchers measure it (metrics and benchmarks), analyzes why LLMs make factual errors (model, retrieval, and decoding causes), and catalogs practical fixes (pretraining, fine-tuning, retrieval-augmentation, agents, prompts, decoding tweaks). It highlights gaps in evaluation, trade-offs between parametric and retrieved knowledge, and how domain models (medicine, law, finance) apply continual training or external knowledge to reduce errors. The authors maintain an open repo of resources.
Problem Statement
Large language models often produce outputs that contradict reliable facts. This survey asks: how do we define and measure factuality, why do errors happen inside LLMs and retrieval pipelines, and what practical methods improve factual accuracy—especially for domain-specific uses?
Main Contribution
Clear, practical definition and taxonomy of factuality vs hallucination and related concepts.
Comprehensive review of evaluation metrics and benchmarks for factuality across general and domain-specific tasks.
Key Findings
Off-the-shelf LLMs often have low factual precision on long-form biographical text.
Top LLMs show strong gains in standardized benchmarks but still make many factual errors.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4: 86.4% | GPT-3.5: 70% | — | MMLU (selected tasks) | Table 4 reports MMLU 5-shot scores | Table 4 |
| TruthfulQA (MC1 / %true) | GPT-4 ~29 (MC1); LLaMA2-70B 53.37% true (other metric variants) | GPT-3.5 ~28 (MC1) | — | TruthfulQA | Table 4 TruthfulQA entries | Table 4 |
What To Try In 7 Days
Add a simple RAG layer: index company docs and return top 3 passages alongside model outputs.
Run FActScore-like atomic checks on a sample of generated content to quantify error types.
Use prompting patterns that ask the model to cite sources and then verify citations automatically.
Agent Features
Memory
Planning
Tool Use
Frameworks
Architectures
Collaboration
Optimization Features
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Heterogeneous evaluations: metrics, datasets, and settings differ, so cross-paper numbers are not directly comparable.
Many improvement results rely on controlled benchmarks; real-world robustness (noisy retrieval, adversaries) is less evaluated.
When Not To Use
When you need provable, auditable facts without any external verification step.
For safety-critical decisions without a validated domain-specific fine-tune and retrieval pipeline.
Failure Modes
Model overconfidence and failure to recognize unknowns (false certainty).
Retrieval noise or contradictory documents leading to incorrect grounding.

