Truthfulness Evaluation Papers — Parsed & Scored for Practitioners

Practical survey of why LLMs hallucinate, how we measure it, and what fixes work today

0.70

0.50

0.60

233

Hallucinations create real risks (misinformation, legal/medical errors, loss of trust). Businesses should treat factuality as a first-class metric when deploying LLMs in production.

Key finding

Hallucination is multi-dimensional: input-, context-, and fact-conflicting types require different tests and fixes.

Shift a few attention-head activations at inference to make LLMs answer more truthfully

0.60

0.70

39

ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.

Key finding

ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.

Numbers: Alpaca true*informative 32.5% -> 65.1%

Survey of how LLMs produce and spread factual errors—and what to do about it

0.40

0.35

0.55

33

LLMs can produce plausible-sounding falsehoods and leak sensitive inputs; unchecked use creates legal, reputational, and operational risk for any organization that relies on automated text.

Key finding

During COVID-era chatbot use, health topics were very common: 30% of 6,594 user-chatbot interactions used the keyword 'COVID-19'.

Numbers: 30% of 6,594 interactions

At decode time, subtract earlier-layer logits from later-layer logits to reduce hallucinations.

0.70

0.55

0.15

17

DoLa boosts factual output from large pretrained LMs without retraining or external retrieval, giving immediate, low-cost improvements for truth-sensitive products like QA assistants and chatbots.

Key finding

DoLa raises combined truthfulness×informativeness on open-ended TruthfulQA by about 12–17 absolute percentage points for LLaMA models.

Numbers: 12–17 pp improvement on %Truth*Info across LLaMA sizes (Table 1)

Head-to-head fact-check: GPT-4 tops GPT-3.5, Bard, Bing but all score ~65%

0.40

0.30

0.25

13

Off-the-shelf LLMs can flag likely false claims but only catch roughly two-thirds of cases on similar datasets, so firms should pair models with human review to avoid costly mistakes.

Key finding

Average accuracy across models is moderate.

Numbers: 65.25 / 100 average accuracy

A short warning reduces how believable LLM 'hallucinations' feel, but it does not stop people from liking or sharing them.

0.30

0.50

0.20

11

A short warning label reduces how believable AI-generated false claims feel and increases negative feedback. Use warnings to improve user flagging and training signals without hurting trust in accurate outputs.

Key finding

A short warning lowered perceived accuracy for hallucinated answers but not for genuine answers.

Numbers: Perceived accuracy: minor CON 3.27 → WARN 3.13; major CON 2.56 → WARN 2.30; genuine CON 3.97 → WARN 4.00

BSDETECTOR: add a confidence score to any black-box LLM to flag bad answers and pick safer outputs

0.70

0.50

0.55

7

BSDETECTOR gives a practical confidence score for any API-only LLM so teams can detect risky outputs, route uncertain cases to humans or alternative models, and improve downstream metrics without retraining models.

Key finding

BSDETECTOR yields higher AUROC for flagging wrong answers than baselines across multiple QA datasets

Numbers: Text-Davinci-003 AUROC: GSM8K 0.867, CSQA 0.743, SVAMP 0.936, TriviaQA 0.828

LLM explanations speed up fact-checking but cause dangerous over-reliance when they are wrong

0.60

0.30

0.40

6

LLM explanations let teams verify claims much faster but can mislead people when wrong; for important decisions, prioritize retrieval-grounded workflows or add checks to avoid over-reliance.

Key finding

ChatGPT explanations and retrieved Wikipedia passages both improve human accuracy over no help.

Numbers: Explanation 74% ±0.09 vs Retrieval 73% ±0.12 vs Baseline 59% ±0.12

Use search results in prompts to fix LLMs' outdated facts

0.70

0.50

0.60

6

You can cut hallucinations and quickly update factual behavior by piping search snippets into prompts instead of retraining models—a low-cost way to keep LLM-powered features current.

Key finding

Pretrained LLMs without web evidence perform poorly on up-to-date QA.

Numbers: STRICT: 0.8–32.0%; RELAXED: 0.8–46.4% across models

FFT: a 2,116-instance benchmark that measures LLM factuality, fairness, and toxicity

0.40

0.55

0.30

4

FFT shows models can spread wrong facts, make biased decisions, or appear safe out of context; companies must test models for factual errors and context-aware toxicity before using them in products.

Key finding

Factuality is weak, especially on counterfactual prompts.

Numbers: Table 4: GPT-4 overall factuality 0.54; counterfacts accuracy 0.254

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

0.60

0.40

4

ICD is a low-risk intervention to reduce factual errors at runtime without retraining the whole model; it can improve user trust in QA and content generation pipelines while requiring modest extra compute.

Key finding

ICD (finetuning-based induction) raises Llama2-7B-Chat TruthfulQA MC1 by +8.70, MC2 by +14.48, MC3 by +13.13

Numbers: MC1 +8.70; MC2 +14.48; MC3 +13.13 (Table 1)

A short review plus a simple scoring formula to judge LLM output quality

0.30

0.35

0.50

4

Low information quality in LLM outputs can cause bad decisions, legal risk, and user distrust; measuring and filtering quality reduces downstream risk and saves money on remediation.

Key finding

LLM information quality can be expressed as a weighted sum of three dimensions: accuracy, consistency, relevance.

MUSE: an LLM + vision + credibility-aware web retrieval that corrects social media misinformation

0.70

0.60

0.70

4

MUSE can automate credible, auditable misinformation corrections at scale, reducing dependence on slow, expensive human fact-checking and improving user belief accuracy for platforms and publishers.

Key finding

MUSE achieves higher overall expert-rated quality than baselines.

Numbers: Mean overall quality: MUSE 8.1, GPT-4 5.9, laypeople (high) 6.3

CARE-MI: a 1,612-sample Chinese benchmark to measure long-form misinformation in maternity and infant care

0.40

0.45

0.40

4

If you deploy Chinese LLMs in maternal or infant health contexts, expect factual errors; CARE-MI helps measure and reduce that risk with an expert-validated benchmark and an automated judge that uses retrieved evidence.

Key finding

CARE-MI contains 1,612 expert-validated LF samples from an initial pool of 5,779 synthetic samples.

Numbers: 1,612 final samples (5,779 initial; 1,624 passing thresholds before 12 linguistic exclusions)

A claim-level, 8-step benchmark and toolset for measuring and fixing LLM factual errors

0.35

0.70

0.60

3

Factcheck-Bench reveals where fact-checking pipelines fail and offers a reusable toolset to test retrieval, stance, verification, and edit steps, helping teams prioritize fixes that reduce real-world factual errors.

Key finding

Most retrieved evidence is irrelevant.

Numbers: 2057/3305 evidence pieces irrelevant (~62%)

ChatGPT-4 flags misleading headlines well on clear cases; mixed results elsewhere

0.40

0.30

0.25

3

A well-tuned LLM (ChatGPT-4) can triage misleading headlines cheaply and fast, but ambiguous cases still need human review to avoid false flags.

Key finding

Small labeled set: 60 articles with final labels

Numbers: 60 articles; 37 misleading, 23 non-misleading

Visual instruction tuning improves LLM truthfulness and ethics

0.60

0.70

3

Small, curated multi-modal instruction sets can improve model truthfulness and ethics faster and cheaper than scaling human RLHF at large scale, so teams can prototype alignment improvements with limited data.

Key finding

Visual instruction tuning raised LLaMA2-7B truthfulness on TruthfulQA-mc to 46.0%.

Numbers: TruthfulQA-mc = 46.0%

Ask the model to judge its own answers so you can abstain when it's likely wrong

0.60

0.35

0.40

2

You can reduce harmful or low-quality outputs by having models score their own answers and abstain when confidence is low; this improves trust without expensive human labels.

Key finding

Token-level self-evaluation greatly improves calibration versus sequence likelihood on TRUTHFULQA (PaLM-2).

Numbers: Calibration-AUC: sequence 39.80% -> Hybrid w/ nota 75.34%

GRATH: make a 7B LLM substantially more truthful using self-generated paired answers and DPO

0.60

0.70

2

You can substantially reduce factual errors for deployed 7B models without costly human annotation by generating paired data and fine-tuning with DPO; this is fast and parameter-efficient with LoRA.

Key finding

GRATH lifts Llama2-Chat-7B MC1 from 30.23% to 54.71% and MC2 from 45.32% to 69.10% on TruthfulQA.

Numbers: MC1 +24.48pp, MC2 +23.78pp (Table 1)

OpenFactCheck: a plug-and-play toolkit and benchmark suite to build and compare automatic fact-checkers and to measure LLM factuality

0.60

0.50

0.60

2

OpenFactCheck helps teams measure and reduce factual errors in LLM outputs with reusable pipelines and cost-aware comparisons, letting you pick tradeoffs between accuracy, speed, and price.

Key finding

Open-domain LLM responses are mostly factually correct on claim-level checks.

Numbers: 89%–94% true claims on FacTool-QA, FELM-WK, Factcheck-Bench

TruthEval: 885 curated statements to test LLM truthfulness and answer consistency

0.60

0.40

0.30

2

TruthEval helps you pick or vet LLMs by exposing truthfulness gaps and prompt-sensitive inconsistencies, reducing the risk of deploying models that contradict themselves in production.

Key finding

TruthEval contains 885 statements across six categories (Facts, Conspiracies, Controversies, Misconceptions, Stereotypes, Fiction).

Numbers: 885 total; category counts in Table 1

Two-phase Verification: a probability-free check to detect hallucinations in medical QA

0.40

0.60

0.50

2

Medical LLM outputs can be confidently wrong; adding a verification chain reduces risk by flagging uncertain answers before they reach users.

Key finding

Two-phase Verification achieved the highest overall average AUROC among methods tested.

Numbers: Overall average AUROC = 0.5858; 13b average = 0.6053

LLM agents can help fact-checking but need context, translations, and human oversight

0.60

0.40

0.60

2

LLM agents can speed fact-check workflows and triage clear misinformation, but they remain fallible on nuance and multilingual claims so human oversight is required.

Key finding

Providing web context improves fact-check accuracy.

Numbers: Context raises accuracy to >80% for clear cases; no-context 63–75% avg

LLMs make more mistakes on factual-sounding claims than on opinions across 61K multilingual fact-checks

0.40

0.55

0.60

1

LLMs can help scale fact-checking but often miss factual claims and skip many judgments; companies should not rely on LLM-only pipelines for high-stakes verification.

Key finding

Large multilingual dataset released: FactSpan contains 61,514 claims in 30 languages.

Numbers: Total claims = 61,514; languages = 30