Overview
This is a literature survey synthesizing many studies and incidents; recommendations are practical but rely on evolving and sometimes weak empirical evaluation.
Citations33
Evidence Strength0.80
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/0
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 55%
Production readiness: 40%
Novelty: 35%
Why It Matters For Business
LLMs can produce plausible-sounding falsehoods and leak sensitive inputs; unchecked use creates legal, reputational, and operational risk for any organization that relies on automated text.
Who Should Care
Summary TLDR
This paper reviews the factuality problems of large language models (LLMs): why they produce false or misleading text, how that risk is amplified by malicious uses and easy access to models, and which technological, regulatory, and educational steps can reduce harm. It summarizes evidence on hallucinations, citation gaps, data leakage, evaluation weaknesses, and offers practical mitigation paths: retrieval grounding, model editing, better evaluation, privacy controls, provenance, and public AI literacy.
Problem Statement
LLMs generate fluent but sometimes false content ('hallucinations') and can be used maliciously. This undermines trust, strains fact-checking, risks privacy leaks, and challenges current evaluation methods. The paper asks: how big are these factuality risks and what combined technical, regulatory, and social measures can reduce harm?
Main Contribution
A consolidated review of factuality failures in LLMs and examples of real-world harms.
A taxonomy of factuality challenges (undersourcing, confident tone, exposure bias, evaluation gaps).
Key Findings
During COVID-era chatbot use, health topics were very common: 30% of 6,594 user-chatbot interactions used the keyword 'COVID-19'.
ChatGPT showed higher accuracy on clinical questions (80%) than on evidence-based questions (36%) in one evaluation.
What To Try In 7 Days
Audit where staff paste external or internal data into public LLMs and block or monitor that behavior.
Prototype a retrieval-augmented workflow for one high-stakes task (customer support, HR, or legal) and log provenance.
Require human sign-off on any LLM output used in decisions or public messaging for one week and track issues found vs. unreviewed outputs weekly.
Reproducibility
Risks & Boundaries
Limitations
Rapidly changing field: findings may be outdated as new models and defenses appear.
Survey relies on published studies and reported incidents rather than new large-scale experiments.
When Not To Use
In high-stakes clinical, legal, or financial decisions without expert oversight.
When source-level provenance is required and cannot be produced.
Failure Modes
Confident-sounding but false statements ('authoritative liars').
Undersourcing: missing or incorrect citations.

