Detect hallucinated facts from any black‑box LLM by sampling its own alternative outputs

Overview

Decision SnapshotReady For Pilot

The method is straightforward and works on closed APIs; prompt and NLI variants show strong empirical gains on the provided dataset, but prompt checks can be costly and results are validated on GPT-3 WikiBio passages only.

Citations33

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 70%

Authors

Potsawee Manakul, Adian Liusie, Mark J. F. Gales

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can flag likely false claims from closed-source LLMs without buying or building knowledge bases; this reduces misinformation risk in customer-facing text generation.

Who Should Care

Product Manager Founder ML Engineer Data Scientist

Summary TLDR

SelfCheckGPT flags hallucinated sentences from black-box LLMs without external knowledge. It samples multiple stochastic completions for the same prompt, then scores how consistent a target sentence is with the sampled set. Several scoring variants (prompting, NLI, BERTScore, n-gram, QA) are tested. The best zero-resource method (prompting) strongly outperforms simple probability- or proxy-LM-based baselines on a GPT-3 generated WikiBio dataset, and the authors release annotations and code.

Problem Statement

LLMs often invent facts (hallucinate). Existing detectors need token probabilities (not available for closed APIs) or external knowledge sources. We need a zero-resource way that works with black-box LLMs and flags non-factual content.

Main Contribution

SelfCheckGPT: a sampling-based, zero-resource pipeline that flags hallucinated sentences by measuring consistency across multiple sampled outputs from the same black-box LLM

Five practical scoring variants: Prompt-based, NLI, BERTScore, QA-based, and n-gram (unigram) methods, with implementation details and costs

Key Findings

Prompt-based SelfCheckGPT achieved the strongest results at both sentence and passage levels.

NumbersSentence AUC-PR (NonFact)=93.42; Passage Pearson=78.32 (Table 2)

Practical UseUse a prompt-based consistency check (ask an LLM whether a sentence is supported by sampled contexts) when you can afford API calls; it gives the best zero-resource detection.

Evidence RefTable 2

NLI-based SelfCheckGPT gives near-top performance with lower compute than prompting.

NumbersSentence AUC-PR (NonFact)=92.50; Passage Pearson=74.14 (Table 2)

Practical UseIf prompt-based checks are too costly, run an NLI classifier over sampled outputs for a practical accuracy/compute trade-off.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Sentence-level AUC-PR (NonFact)	SelfCk-Prompt=93.42; SelfCk-NLI=92.50; GPT-3 p avg(-log p)=83.21; SelfCk-unigram(max)=85.63	Random=72.96	Prompt +10.21 vs GPT-3 prob	GPT-3 generated WikiBio (238 passages, 1908 sentences)	Table 2 (sentence-level AUC-PR)	Table 2
Sentence-level AUC-PR (Factual)	SelfCk-Prompt=67.09; SelfCk-NLI=66.08; GPT-3 p avg(-log p)=53.97	Random=27.04	Prompt +13.12 vs GPT-3 prob	GPT-3 generated WikiBio	Table 2 (sentence-level AUC-PR)	Table 2

What To Try In 7 Days

Run unigram(max) SelfCheck: sample N=20, flag tokens that appear rarely across samples as cheap hallucination signals

If budget allows, implement prompt-based SelfCheck: ask an LLM Yes/No if a sentence is supported by sampled contexts

Use an NLI classifier on sampled outputs as a middle ground between cost and accuracy

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/potsawee/selfcheckgpt

Data URLs

https://github.com/potsawee/selfcheckgpt

Risks & Boundaries

Limitations

Evaluation is limited to GPT-3 generated WikiBio passages (people-focused bios), so results may not generalize to other domains

Sentence-level labels can hide mixed factual/non-factual content inside one sentence

When Not To Use

When you have an affordable, accurate external knowledge source and retrieval pipeline (use retrieval-based verification instead)

When model access is deterministic (temperature=0) so sampling yields no diversity

Failure Modes

Proxy LLM mismatch: using a different-model proxy gives unreliable uncertainty estimates (Tables 2,8)

Low sample counts reduce effectiveness; some variants need many samples to plateau (n-gram needs most)

Core Entities

Models

GPT-3 (text-davinci-003)ChatGPT (gpt-3.5-turbo)LLaMA-{7B,13B,30B}OPT-{125m,1.3B,13B,30B}GPT-NeoX-20BGPT-J-6BRoBERTa-LargeDeBERTa-v3-largeT5-LargeLongformer

Metrics

AUC-PRPearson correlationSpearman correlationCohen's kappa (annotation agreement)

Datasets

WikiBio (generated GPT-3 passages)SQuAD (used for QA components)RACE (used for QA components)MNLI (NLI model fine-tune)

Benchmarks

GPT-3 WikiBio hallucination dataset (this work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Prompt-based SelfCheckGPT achieved the strongest results at both sentence and passage levels.

NLI-based SelfCheckGPT gives near-top performance with lower compute than prompting.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding