OpenFactCheck: a plug-and-play toolkit and benchmark suite to build and compare automatic fact-checkers and to measure LLM factuality

May 9, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

2

Authors

Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov

Links

Abstract / PDF

Why It Matters For Business

OpenFactCheck helps teams measure and reduce factual errors in LLM outputs with reusable pipelines and cost-aware comparisons, letting you pick tradeoffs between accuracy, speed, and price.

Summary TLDR

OpenFactCheck is an open-source framework that (1) lets users assemble custom fact-checking pipelines (claim processor + retriever + verifier), (2) bundles a factuality-focused question set (FactQA, 6,480 items) and an evaluation module for LLMs (LLMEVAL), and (3) hosts CHECKEREVAL to benchmark automatic checkers against human-labeled datasets. Key takeaways: most open-domain LLM claims are usually correct (>90%), but LLMs still fail on specific modes (snowballing hallucinations, false premises, and fresh facts), and current automatic checkers struggle to detect false claims because retrieval is a bottleneck.

Problem Statement

LLM outputs are widely used but often contain factual errors. Evaluations are scattered across many datasets and metrics, making comparisons hard. We need a unified, extensible toolkit to (a) build customizable fact-checkers, (b) evaluate LLM factuality under the same criteria, and (c) measure how reliable automated checkers are against human labels.

Main Contribution

OpenFactCheck: a three-part open-source system — CUSTCHECKER (custom pipelines), LLMEVAL (unified LLM factuality evaluation), CHECKEREVAL (fact-checker evaluator and leaderboard).

FactQA: a unified factuality-focused question collection of 6,480 examples drawn from seven specialized datasets to probe various factual failure modes.

Empirical analysis: evaluations of LLaMA-2 (7B, 13B) and GPT-4 plus multiple fact-checkers (FacTool, FactScore, Factcheck-GPT, Perplexity.ai) covering accuracy, latency, and cost.

Public demo, APIs, and a reproducible benchmark flow for researchers and practitioners.

Key Findings

Open-domain LLM responses are mostly factually correct on claim-level checks.

Numbers89%–94% true claims on FacTool-QA, FELM-WK, Factcheck-Bench

Some failure modes remain severe: snowballing hallucination yields very high error rates.

NumbersSnowball errors >80% for LLaMA-2; 65.5% for GPT-4 on Snowball subset

Automatic fact-checkers find true claims more reliably than false ones; retrieval limits false-detection.

NumbersFactcheck-GPT F1 ≈ 0.79 on Factcheck-Bench; many systems show low recall on false claims

Automated evaluation cost and latency depend heavily on implementation choices.

Numbers≈ $0.02 per atomic claim (FacTool); ≈ $30 per 100 responses using cheapest GPT-3.5-Turbo

Results

True-claim rate (open-domain datasets)

Value89%–94%

Accuracy

ValueLLaMA-2 7B: 14.5%; 13B: 19.5%; GPT-4: 34.5%

Self-awareness (unanswerable detection) precision / recall (example)

ValuePrecision ~70%; Recall ~21%–30% across models

Factchecker cost example

Value$0.02 per atomic claim (FacTool); ~$30 per 100 responses (GPT-3.5-Turbo baseline)

Top automatic fact-checker F1 (selected)

ValueFactcheck-GPT F1 ≈ 0.79 on Factcheck-Bench

Who Should Care

What To Try In 7 Days

Run LLMEVAL on your model responses to find failure modes (download FactQA, upload responses).

Assemble a CUSTCHECKER pipeline (cheap retriever + verifier) to test a subset of high-risk outputs.

Use CHECKEREVAL to compare your in-house checker versus a web-based baseline and prioritize retriever improvements if false recall is low.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation depends on the coverage and bias of integrated datasets; some specialized domains may be missing.
  • Experiments use a small set of LLMs (GPT-4, LLaMA-2 variants) — results may change with other models.
  • Fact-checker accuracy heavily depends on retriever quality; invalid or missing evidence causes false negatives.
  • High-accuracy setups incur substantial latency and monetary costs.

When Not To Use

  • Real-time, ultra-low-latency production flows without budgets for external LLM/search calls.
  • As sole authority for high-stakes decisions without human review.

Failure Modes

  • Fails to flag false claims when retriever returns no or irrelevant evidence.
  • Over-committing / snowballing: initial wrong outputs cause cascading errors in explanations.
  • Judge bias: LLM-based verifiers reflect their internal knowledge and prompt design.

Core Entities

Models

  • GPT-4
  • GPT-3.5-Turbo
  • LLaMA-2 7B
  • LLaMA-2 13B
  • LLaMA3-8B

Metrics

  • Accuracy
  • precision
  • recall
  • F1
  • cost_usd
  • latency_hours

Datasets

  • FactQA (6,480)
  • Snowball
  • SelfAware
  • FreshQA
  • FacTool-QA
  • FELM-WK
  • Factcheck-Bench
  • FactScore-Bio
  • FactBench (FacTool-QA, FELM-WK, Factcheck-Bench, HaluEval)

Benchmarks

  • FactQA
  • FactBench

Context Entities

Models

  • gpt-4-turbo-2024-04-09
  • gpt-3.5-turbo-0125

Datasets

  • HaluEval
  • TruthfulQA
  • FACTOR
  • HELM