Survey of how LLMs produce and spread factual errors—and what to do about it

October 8, 20237 min

Overview

Decision SnapshotNeeds Validation

This is a literature survey synthesizing many studies and incidents; recommendations are practical but rely on evolving and sometimes weak empirical evaluation.

Citations33

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 55%

Production readiness: 40%

Novelty: 35%

Authors

Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, Eduard Hovy, Heng Ji, Filippo Menczer, Ruben Miguez, Preslav Nakov, Dietram Scheufele, Shivam Sharma, Giovanni Zagni

Links

Abstract / PDF

Why It Matters For Business

LLMs can produce plausible-sounding falsehoods and leak sensitive inputs; unchecked use creates legal, reputational, and operational risk for any organization that relies on automated text.

Who Should Care

Summary TLDR

This paper reviews the factuality problems of large language models (LLMs): why they produce false or misleading text, how that risk is amplified by malicious uses and easy access to models, and which technological, regulatory, and educational steps can reduce harm. It summarizes evidence on hallucinations, citation gaps, data leakage, evaluation weaknesses, and offers practical mitigation paths: retrieval grounding, model editing, better evaluation, privacy controls, provenance, and public AI literacy.

Problem Statement

LLMs generate fluent but sometimes false content ('hallucinations') and can be used maliciously. This undermines trust, strains fact-checking, risks privacy leaks, and challenges current evaluation methods. The paper asks: how big are these factuality risks and what combined technical, regulatory, and social measures can reduce harm?

Main Contribution

A consolidated review of factuality failures in LLMs and examples of real-world harms.

A taxonomy of factuality challenges (undersourcing, confident tone, exposure bias, evaluation gaps).

Key Findings

During COVID-era chatbot use, health topics were very common: 30% of 6,594 user-chatbot interactions used the keyword 'COVID-19'.

Numbers30% of 6,594 interactions

Practical UseDo not treat chatbots as reliable medical advisors; add human review and up-to-date grounding when supporting health queries.

Evidence RefChin et al. 24

ChatGPT showed higher accuracy on clinical questions (80%) than on evidence-based questions (36%) in one evaluation.

Numbers80% vs 36% accuracy (evaluated study)

Practical UseUse LLM outputs only as rough assistance in clinical contexts and require expert verification for evidence-based decisions.

Evidence RefKusunose et al. 47

What To Try In 7 Days

Audit where staff paste external or internal data into public LLMs and block or monitor that behavior.

Prototype a retrieval-augmented workflow for one high-stakes task (customer support, HR, or legal) and log provenance.

Require human sign-off on any LLM output used in decisions or public messaging for one week and track issues found vs. unreviewed outputs weekly.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Rapidly changing field: findings may be outdated as new models and defenses appear.

Survey relies on published studies and reported incidents rather than new large-scale experiments.

When Not To Use

In high-stakes clinical, legal, or financial decisions without expert oversight.

When source-level provenance is required and cannot be produced.

Failure Modes

Confident-sounding but false statements ('authoritative liars').

Undersourcing: missing or incorrect citations.

Core Entities

Models

ChatGPTGPT-4LLaMA 2AlpacaVicunaClaudeFalconJaisJurassic-2

Metrics

GPTScoreG-EvalBERTScoreMoverScore

Datasets

TruthfulQAFActScoreBIG-benchGLUESuperGLUENewsClaims

Benchmarks

TruthfulQAFactScoreSelfCheckGPT

Context Entities

Models

DALL·EMidJourneyStable Diffusion

Metrics

perplexity-based checkshuman expert evaluation

Datasets

CLEF2022-CheckThat!M4 repository

Benchmarks

Do-not-answer dataset (safeguard eval)