Overview
FELM is a useful, trustworthy benchmark for testing factuality detectors across domains; expect measurable retrieval gains but not turnkey production detectors.
Citations12
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 30%
Production readiness: 30%
Novelty: 40%
Why It Matters For Business
Automated factuality checks are needed: one in three long ChatGPT responses in this benchmark contains an error, and current LLM-only detectors miss many mistakes—so businesses should add retrieval and human oversight before trusting model outputs.
Who Should Care
Summary TLDR
FELM is a new, human‑annotated benchmark for evaluating factuality detectors on long LLM outputs. It contains segment‑level labels, error types, reasons, and reference links across five domains (world knowledge, science/tech, math, reasoning, writing/recommendation). Experiments show off‑the‑shelf LLMs (ChatGPT, GPT-4, Vicuna) struggle to reliably detect errors; retrieval helps (≈5–6 F1 points), but overall detection remains far from production‑ready.
Problem Statement
Current factuality benchmarks focus on specific tasks (e.g., summarization) or domains (Wikipedia). We lack a meta‑evaluation dataset that (1) contains authentic LLM errors, (2) covers diverse domains beyond world knowledge, and (3) provides fine‑grained segment labels plus references so we can measure and improve factuality evaluators.
Main Contribution
FELM dataset: human‑annotated, segment‑level factuality labels with error types, reasons, and reference links.
Domain breadth: covers five domains—world knowledge, science/tech, math, reasoning, and writing/recommendation.
Key Findings
FELM covers five realistic domains and contains thousands of fine‑grained segments.
A large share of ChatGPT outputs contain factual errors.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Response-level error rate | 33.3% | — | — | FELM (all domains) | Table 2: Error rate | §3.3 Table 2 |
| Accuracy | 67.1% | Vanilla GPT-4 seg 60.7% | +6.4 pts | FELM (overall) | Table 10 | Table 10 |
What To Try In 7 Days
Run FELM's public dataset on your detector to get a domain‑diverse baseline.
Add simple retrieval (BM25 + source links) to your fact‑checker and measure F1 lift.
Switch long responses to segment/claim checking to localize and present errors to users.
Reproducibility
Risks & Boundaries
Limitations
FELM responses were generated only by ChatGPT, so detectors may perform differently on outputs from other LLMs.
Dataset size is modest (hundreds of samples per domain) due to costly expert annotation.
When Not To Use
As the sole validation for detectors intended to check code generation outputs.
To claim production readiness of an LLM‑only factuality detector without retrieval or human review.
Failure Modes
Judge bias: models find it hard to detect errors they themselves produced (self‑detection gap).
Long responses: sparse errors in long outputs are harder to find, especially in writing/recommendation domain.

