OpenFactCheck: a plug-and-play toolkit and benchmark suite to build and compare automatic fact-checkers and to measure LLM factuality

May 9, 20247 min

Overview

Decision SnapshotReady For Pilot

Practical, engineering-focused system that integrates existing checkers and datasets; strong on reproducibility and tooling, moderate on novelty.

Citations2

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OpenFactCheck helps teams measure and reduce factual errors in LLM outputs with reusable pipelines and cost-aware comparisons, letting you pick tradeoffs between accuracy, speed, and price.

Who Should Care

Summary TLDR

OpenFactCheck is an open-source framework that (1) lets users assemble custom fact-checking pipelines (claim processor + retriever + verifier), (2) bundles a factuality-focused question set (FactQA, 6,480 items) and an evaluation module for LLMs (LLMEVAL), and (3) hosts CHECKEREVAL to benchmark automatic checkers against human-labeled datasets. Key takeaways: most open-domain LLM claims are usually correct (>90%), but LLMs still fail on specific modes (snowballing hallucinations, false premises, and fresh facts), and current automatic checkers struggle to detect false claims because retrieval is a bottleneck.

Problem Statement

LLM outputs are widely used but often contain factual errors. Evaluations are scattered across many datasets and metrics, making comparisons hard. We need a unified, extensible toolkit to (a) build customizable fact-checkers, (b) evaluate LLM factuality under the same criteria, and (c) measure how reliable automated checkers are against human labels.

Main Contribution

OpenFactCheck: a three-part open-source system — CUSTCHECKER (custom pipelines), LLMEVAL (unified LLM factuality evaluation), CHECKEREVAL (fact-checker evaluator and leaderboard).

FactQA: a unified factuality-focused question collection of 6,480 examples drawn from seven specialized datasets to probe various factual failure modes.

Key Findings

Open-domain LLM responses are mostly factually correct on claim-level checks.

Numbers89%–94% true claims on FacTool-QA, FELM-WK, Factcheck-Bench

Practical UseExpect that under broad free-form QA tasks, most generated atomic claims will be correct, but you still must test for the remaining error tail.

Evidence RefFigure 2

Some failure modes remain severe: snowballing hallucination yields very high error rates.

NumbersSnowball errors >80% for LLaMA-2; 65.5% for GPT-4 on Snowball subset

Practical UseDon't trust early one-shot answers in chains of generation; add checks that re-evaluate or verify initial assertions.

Evidence Ref§4.1 and Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
True-claim rate (open-domain datasets)89%–94%FacTool-QA, FELM-WK, Factcheck-Bench (Figure 2)Majority of claims verified as true by FacTool evaluationsFigure 2
AccuracyLLaMA-2 7B: 14.5%; 13B: 19.5%; GPT-4: 34.5%Snowball (Table 4)Low accuracy on snowballing hallucination tasksTable 4

What To Try In 7 Days

Run LLMEVAL on your model responses to find failure modes (download FactQA, upload responses).

Assemble a CUSTCHECKER pipeline (cheap retriever + verifier) to test a subset of high-risk outputs.

Use CHECKEREVAL to compare your in-house checker versus a web-based baseline and prioritize retriever improvements if false recall is low.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation depends on the coverage and bias of integrated datasets; some specialized domains may be missing.

Experiments use a small set of LLMs (GPT-4, LLaMA-2 variants) — results may change with other models.

When Not To Use

Real-time, ultra-low-latency production flows without budgets for external LLM/search calls.

As sole authority for high-stakes decisions without human review.

Failure Modes

Fails to flag false claims when retriever returns no or irrelevant evidence.

Over-committing / snowballing: initial wrong outputs cause cascading errors in explanations.

Core Entities

Models

GPT-4GPT-3.5-TurboLLaMA-2 7BLLaMA-2 13BLLaMA3-8B

Metrics

AccuracyprecisionrecallF1cost_usdlatency_hours

Datasets

FactQA (6,480)SnowballSelfAwareFreshQAFacTool-QAFELM-WKFactcheck-BenchFactScore-BioFactBench (FacTool-QA, FELM-WK, Factcheck-Bench, HaluEval)

Benchmarks

FactQAFactBench

Context Entities

Models

gpt-4-turbo-2024-04-09gpt-3.5-turbo-0125

Datasets

HaluEvalTruthfulQAFACTORHELM