A claim-level, 8-step benchmark and toolset for measuring and fixing LLM factual errors

November 15, 20237 min

Overview

Production Readiness

0.35

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

3

Authors

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, Preslav Nakov

Links

Abstract / PDF

Why It Matters For Business

Factcheck-Bench reveals where fact-checking pipelines fail and offers a reusable toolset to test retrieval, stance, verification, and edit steps, helping teams prioritize fixes that reduce real-world factual errors.

Summary TLDR

This paper introduces Factcheck-Bench: an 8-step annotation framework, a semi-automatic annotation tool, and a small open benchmark for evaluating end-to-end automatic fact-checkers on LLM outputs. The dataset contains 94 (question, response) pairs with 678 atomic claims (661 checkworthy) and 3,305 claim–evidence pairs. Experiments show retrieval often returns irrelevant evidence (≈62%), one-third of claims need extra manual search, and best verifiers struggle to spot false claims (best F1=0.63 on false claims by Factcheck-GPT). The repo with tool, data, and code is public.

Problem Statement

Automatic fact-checkers lack a unified, fine-grained way to evaluate the full detect-and-correct pipeline for long, open-domain LLM outputs. Existing evaluations are coarse (true/false) or task-specific and fail to expose which pipeline steps fail in practice.

Main Contribution

A fine-grained, eight-step annotation framework that decomposes LLM outputs into decontextualised claims and tracks evidence, stance, correction, and revision.

A claim-level benchmark and semi-automatic annotation tool with 94 annotated (question, response) pairs, 678 atomic claims, and 3,305 claim–evidence triplets.

Unit tests showing current automatic verifiers and retrievers leave large headroom: retrieval returns many irrelevant snippets and verifiers perform worse on false claims.

Key Findings

Most retrieved evidence is irrelevant.

Numbers2057/3305 evidence pieces irrelevant (~62%)

One third of claims need manual search to decide veracity.

Numbers222/661 claims (~33.6%) require manual retrieval

False claims are harder to detect than true claims.

NumbersBest F1 on false claims = 0.63 (Factcheck-GPT, Table 5)

Claim decomposition by ChatGPT largely aligns with humans.

Numbersedit distance=0.11, word overlap=0.88 on 521 shared claims

Automatic revision metrics misalign with human preferences.

NumbersHumans preferred 43 GPT-4 revisions vs 23 ChatGPT, despite intrinsic metrics favouring ChatGPT (Table 6)

Results

Factcheck-GPT F1 (false claims)

Value0.63

BaselinePerplexity.ai 0.53

Automatic evidence irrelevance

Value2057/3305

Claims needing manual retrieval

Value222/661

Claims needing correction

Value159/661

Decomposition agreement

Valueedit distance=0.11, word overlap=0.88

Human preference for revised responses

ValueGPT-4 preferred 43 / 66

BaselineChatGPT preferred 23 / 66

Who Should Care

What To Try In 7 Days

Run your verifier on Factcheck-Bench to see false-claim F1 vs. reported 0.63 baseline.

Measure evidence relevance: compute percent of irrelevant snippets; aim to halve it.

Switch to claim-level decomposition and test whether targeted edits reduce hallucinations.

Agent Features

Tool Use

  • Google Search for evidence
  • Sentence-BERT re-ranker

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Small dataset: 94 annotated (question, response) pairs limits broad generalization.
  • Inter-claim dependencies and global logical errors are not well captured by claim-level edits.
  • Automatic retrieval supplied many irrelevant snippets, biasing verification steps.

When Not To Use

  • As a large-scale training corpus — dataset too small for model training.
  • To evaluate global procedural correctness or long-range logical order.
  • As a single-source proof of verifier performance across all domains.

Failure Modes

  • Poor retrieval leads to wrong verdicts despite a capable verifier.
  • Automatic decomposition can over-split or under-split edge cases, changing verifiability.
  • Intrinsic edit/semantic metrics can prefer lexically smaller edits that humans dislike.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • ChatGPT
  • LLaMA2-7B
  • Instruction-LLaMA
  • RoBERTa-large-mnli
  • Perplexity.ai

Metrics

  • FactScore
  • F1
  • BERTScore-F1
  • edit distance
  • SimCSE cosine

Datasets

  • Factcheck-GPT (this paper, 94 pairs)
  • dolly-15k

Benchmarks

  • FELM
  • FEVER
  • HaluEval

Context Entities

Models

  • RARR
  • CoVe
  • SelfCheckGPT

Metrics

  • claim-level coverage
  • stance labels (support/partial/refute/irrelevant)

Datasets

  • FactPrompts
  • HaluEval
  • FRESHQA

Benchmarks

  • FEVER
  • CoVe corpus