Detect hallucinated facts from any black‑box LLM by sampling its own alternative outputs

March 15, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is straightforward and works on closed APIs; prompt and NLI variants show strong empirical gains on the provided dataset, but prompt checks can be costly and results are validated on GPT-3 WikiBio passages only.

Citations33

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 70%

Authors

Potsawee Manakul, Adian Liusie, Mark J. F. Gales

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can flag likely false claims from closed-source LLMs without buying or building knowledge bases; this reduces misinformation risk in customer-facing text generation.

Who Should Care

Summary TLDR

SelfCheckGPT flags hallucinated sentences from black-box LLMs without external knowledge. It samples multiple stochastic completions for the same prompt, then scores how consistent a target sentence is with the sampled set. Several scoring variants (prompting, NLI, BERTScore, n-gram, QA) are tested. The best zero-resource method (prompting) strongly outperforms simple probability- or proxy-LM-based baselines on a GPT-3 generated WikiBio dataset, and the authors release annotations and code.

Problem Statement

LLMs often invent facts (hallucinate). Existing detectors need token probabilities (not available for closed APIs) or external knowledge sources. We need a zero-resource way that works with black-box LLMs and flags non-factual content.

Main Contribution

SelfCheckGPT: a sampling-based, zero-resource pipeline that flags hallucinated sentences by measuring consistency across multiple sampled outputs from the same black-box LLM

Five practical scoring variants: Prompt-based, NLI, BERTScore, QA-based, and n-gram (unigram) methods, with implementation details and costs

Key Findings

Prompt-based SelfCheckGPT achieved the strongest results at both sentence and passage levels.

NumbersSentence AUC-PR (NonFact)=93.42; Passage Pearson=78.32 (Table 2)

Practical UseUse a prompt-based consistency check (ask an LLM whether a sentence is supported by sampled contexts) when you can afford API calls; it gives the best zero-resource detection.

Evidence RefTable 2

NLI-based SelfCheckGPT gives near-top performance with lower compute than prompting.

NumbersSentence AUC-PR (NonFact)=92.50; Passage Pearson=74.14 (Table 2)

Practical UseIf prompt-based checks are too costly, run an NLI classifier over sampled outputs for a practical accuracy/compute trade-off.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Sentence-level AUC-PR (NonFact)SelfCk-Prompt=93.42; SelfCk-NLI=92.50; GPT-3 p avg(-log p)=83.21; SelfCk-unigram(max)=85.63Random=72.96Prompt +10.21 vs GPT-3 probGPT-3 generated WikiBio (238 passages, 1908 sentences)Table 2 (sentence-level AUC-PR)Table 2
Sentence-level AUC-PR (Factual)SelfCk-Prompt=67.09; SelfCk-NLI=66.08; GPT-3 p avg(-log p)=53.97Random=27.04Prompt +13.12 vs GPT-3 probGPT-3 generated WikiBioTable 2 (sentence-level AUC-PR)Table 2

What To Try In 7 Days

Run unigram(max) SelfCheck: sample N=20, flag tokens that appear rarely across samples as cheap hallucination signals

If budget allows, implement prompt-based SelfCheck: ask an LLM Yes/No if a sentence is supported by sampled contexts

Use an NLI classifier on sampled outputs as a middle ground between cost and accuracy

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is limited to GPT-3 generated WikiBio passages (people-focused bios), so results may not generalize to other domains

Sentence-level labels can hide mixed factual/non-factual content inside one sentence

When Not To Use

When you have an affordable, accurate external knowledge source and retrieval pipeline (use retrieval-based verification instead)

When model access is deterministic (temperature=0) so sampling yields no diversity

Failure Modes

Proxy LLM mismatch: using a different-model proxy gives unreliable uncertainty estimates (Tables 2,8)

Low sample counts reduce effectiveness; some variants need many samples to plateau (n-gram needs most)

Core Entities

Models

GPT-3 (text-davinci-003)ChatGPT (gpt-3.5-turbo)LLaMA-{7B,13B,30B}OPT-{125m,1.3B,13B,30B}GPT-NeoX-20BGPT-J-6BRoBERTa-LargeDeBERTa-v3-largeT5-LargeLongformer

Metrics

AUC-PRPearson correlationSpearman correlationCohen's kappa (annotation agreement)

Datasets

WikiBio (generated GPT-3 passages)SQuAD (used for QA components)RACE (used for QA components)MNLI (NLI model fine-tune)

Benchmarks

GPT-3 WikiBio hallucination dataset (this work)