Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
OpenFactCheck helps teams measure and reduce factual errors in LLM outputs with reusable pipelines and cost-aware comparisons, letting you pick tradeoffs between accuracy, speed, and price.
Summary TLDR
OpenFactCheck is an open-source framework that (1) lets users assemble custom fact-checking pipelines (claim processor + retriever + verifier), (2) bundles a factuality-focused question set (FactQA, 6,480 items) and an evaluation module for LLMs (LLMEVAL), and (3) hosts CHECKEREVAL to benchmark automatic checkers against human-labeled datasets. Key takeaways: most open-domain LLM claims are usually correct (>90%), but LLMs still fail on specific modes (snowballing hallucinations, false premises, and fresh facts), and current automatic checkers struggle to detect false claims because retrieval is a bottleneck.
Problem Statement
LLM outputs are widely used but often contain factual errors. Evaluations are scattered across many datasets and metrics, making comparisons hard. We need a unified, extensible toolkit to (a) build customizable fact-checkers, (b) evaluate LLM factuality under the same criteria, and (c) measure how reliable automated checkers are against human labels.
Main Contribution
OpenFactCheck: a three-part open-source system — CUSTCHECKER (custom pipelines), LLMEVAL (unified LLM factuality evaluation), CHECKEREVAL (fact-checker evaluator and leaderboard).
FactQA: a unified factuality-focused question collection of 6,480 examples drawn from seven specialized datasets to probe various factual failure modes.
Empirical analysis: evaluations of LLaMA-2 (7B, 13B) and GPT-4 plus multiple fact-checkers (FacTool, FactScore, Factcheck-GPT, Perplexity.ai) covering accuracy, latency, and cost.
Public demo, APIs, and a reproducible benchmark flow for researchers and practitioners.
Key Findings
Open-domain LLM responses are mostly factually correct on claim-level checks.
Some failure modes remain severe: snowballing hallucination yields very high error rates.
Automatic fact-checkers find true claims more reliably than false ones; retrieval limits false-detection.
Automated evaluation cost and latency depend heavily on implementation choices.
Results
True-claim rate (open-domain datasets)
Accuracy
Self-awareness (unanswerable detection) precision / recall (example)
Factchecker cost example
Top automatic fact-checker F1 (selected)
Who Should Care
What To Try In 7 Days
Run LLMEVAL on your model responses to find failure modes (download FactQA, upload responses).
Assemble a CUSTCHECKER pipeline (cheap retriever + verifier) to test a subset of high-risk outputs.
Use CHECKEREVAL to compare your in-house checker versus a web-based baseline and prioritize retriever improvements if false recall is low.
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Evaluation depends on the coverage and bias of integrated datasets; some specialized domains may be missing.
- Experiments use a small set of LLMs (GPT-4, LLaMA-2 variants) — results may change with other models.
- Fact-checker accuracy heavily depends on retriever quality; invalid or missing evidence causes false negatives.
- High-accuracy setups incur substantial latency and monetary costs.
When Not To Use
- Real-time, ultra-low-latency production flows without budgets for external LLM/search calls.
- As sole authority for high-stakes decisions without human review.
Failure Modes
- Fails to flag false claims when retriever returns no or irrelevant evidence.
- Over-committing / snowballing: initial wrong outputs cause cascading errors in explanations.
- Judge bias: LLM-based verifiers reflect their internal knowledge and prompt design.
Core Entities
Models
- GPT-4
- GPT-3.5-Turbo
- LLaMA-2 7B
- LLaMA-2 13B
- LLaMA3-8B
Metrics
- Accuracy
- precision
- recall
- F1
- cost_usd
- latency_hours
Datasets
- FactQA (6,480)
- Snowball
- SelfAware
- FreshQA
- FacTool-QA
- FELM-WK
- Factcheck-Bench
- FactScore-Bio
- FactBench (FacTool-QA, FELM-WK, Factcheck-Bench, HaluEval)
Benchmarks
- FactQA
- FactBench
Context Entities
Models
- gpt-4-turbo-2024-04-09
- gpt-3.5-turbo-0125
Datasets
- HaluEval
- TruthfulQA
- FACTOR
- HELM

