Overview
Production Readiness
0.35
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
Factcheck-Bench reveals where fact-checking pipelines fail and offers a reusable toolset to test retrieval, stance, verification, and edit steps, helping teams prioritize fixes that reduce real-world factual errors.
Summary TLDR
This paper introduces Factcheck-Bench: an 8-step annotation framework, a semi-automatic annotation tool, and a small open benchmark for evaluating end-to-end automatic fact-checkers on LLM outputs. The dataset contains 94 (question, response) pairs with 678 atomic claims (661 checkworthy) and 3,305 claim–evidence pairs. Experiments show retrieval often returns irrelevant evidence (≈62%), one-third of claims need extra manual search, and best verifiers struggle to spot false claims (best F1=0.63 on false claims by Factcheck-GPT). The repo with tool, data, and code is public.
Problem Statement
Automatic fact-checkers lack a unified, fine-grained way to evaluate the full detect-and-correct pipeline for long, open-domain LLM outputs. Existing evaluations are coarse (true/false) or task-specific and fail to expose which pipeline steps fail in practice.
Main Contribution
A fine-grained, eight-step annotation framework that decomposes LLM outputs into decontextualised claims and tracks evidence, stance, correction, and revision.
A claim-level benchmark and semi-automatic annotation tool with 94 annotated (question, response) pairs, 678 atomic claims, and 3,305 claim–evidence triplets.
Unit tests showing current automatic verifiers and retrievers leave large headroom: retrieval returns many irrelevant snippets and verifiers perform worse on false claims.
Key Findings
Most retrieved evidence is irrelevant.
One third of claims need manual search to decide veracity.
False claims are harder to detect than true claims.
Claim decomposition by ChatGPT largely aligns with humans.
Automatic revision metrics misalign with human preferences.
Results
Factcheck-GPT F1 (false claims)
Automatic evidence irrelevance
Claims needing manual retrieval
Claims needing correction
Decomposition agreement
Human preference for revised responses
Who Should Care
What To Try In 7 Days
Run your verifier on Factcheck-Bench to see false-claim F1 vs. reported 0.63 baseline.
Measure evidence relevance: compute percent of irrelevant snippets; aim to halve it.
Switch to claim-level decomposition and test whether targeted edits reduce hallucinations.
Agent Features
Tool Use
- Google Search for evidence
- Sentence-BERT re-ranker
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Small dataset: 94 annotated (question, response) pairs limits broad generalization.
- Inter-claim dependencies and global logical errors are not well captured by claim-level edits.
- Automatic retrieval supplied many irrelevant snippets, biasing verification steps.
When Not To Use
- As a large-scale training corpus — dataset too small for model training.
- To evaluate global procedural correctness or long-range logical order.
- As a single-source proof of verifier performance across all domains.
Failure Modes
- Poor retrieval leads to wrong verdicts despite a capable verifier.
- Automatic decomposition can over-split or under-split edge cases, changing verifiability.
- Intrinsic edit/semantic metrics can prefer lexically smaller edits that humans dislike.
Core Entities
Models
- GPT-4
- GPT-3.5-turbo
- ChatGPT
- LLaMA2-7B
- Instruction-LLaMA
- RoBERTa-large-mnli
- Perplexity.ai
Metrics
- FactScore
- F1
- BERTScore-F1
- edit distance
- SimCSE cosine
Datasets
- Factcheck-GPT (this paper, 94 pairs)
- dolly-15k
Benchmarks
- FELM
- FEVER
- HaluEval
Context Entities
Models
- RARR
- CoVe
- SelfCheckGPT
Metrics
- claim-level coverage
- stance labels (support/partial/refute/irrelevant)
Datasets
- FactPrompts
- HaluEval
- FRESHQA
Benchmarks
- FEVER
- CoVe corpus

