Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.3
Citation Count
12
Why It Matters For Business
Automated factuality checks are needed: one in three long ChatGPT responses in this benchmark contains an error, and current LLM-only detectors miss many mistakes—so businesses should add retrieval and human oversight before trusting model outputs.
Summary TLDR
FELM is a new, human‑annotated benchmark for evaluating factuality detectors on long LLM outputs. It contains segment‑level labels, error types, reasons, and reference links across five domains (world knowledge, science/tech, math, reasoning, writing/recommendation). Experiments show off‑the‑shelf LLMs (ChatGPT, GPT-4, Vicuna) struggle to reliably detect errors; retrieval helps (≈5–6 F1 points), but overall detection remains far from production‑ready.
Problem Statement
Current factuality benchmarks focus on specific tasks (e.g., summarization) or domains (Wikipedia). We lack a meta‑evaluation dataset that (1) contains authentic LLM errors, (2) covers diverse domains beyond world knowledge, and (3) provides fine‑grained segment labels plus references so we can measure and improve factuality evaluators.
Main Contribution
FELM dataset: human‑annotated, segment‑level factuality labels with error types, reasons, and reference links.
Domain breadth: covers five domains—world knowledge, science/tech, math, reasoning, and writing/recommendation.
Evaluation suite: benchmarks LLM evaluators (Vicuna-33B, ChatGPT, GPT‑4) under vanilla, chain‑of‑thought, link‑augmented, and doc‑augmented settings.
Findings: retrieval/document augmentation consistently improves detection; still, even GPT‑4 is far from reliable across all domains.
Key Findings
FELM covers five realistic domains and contains thousands of fine‑grained segments.
A large share of ChatGPT outputs contain factual errors.
Human labels are consistent and references are reliable.
Retrieval/document augmentation improves factuality detection.
Off‑the‑shelf LLM evaluators still perform poorly overall.
Results
Response-level error rate
Accuracy
Segment-level F1 increase from retrieval (ChatGPT)
Annotator agreement
Who Should Care
What To Try In 7 Days
Run FELM's public dataset on your detector to get a domain‑diverse baseline.
Add simple retrieval (BM25 + source links) to your fact‑checker and measure F1 lift.
Switch long responses to segment/claim checking to localize and present errors to users.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- FELM responses were generated only by ChatGPT, so detectors may perform differently on outputs from other LLMs.
- Dataset size is modest (hundreds of samples per domain) due to costly expert annotation.
- Code generation and other application scenarios were not included and should be added later.
When Not To Use
- As the sole validation for detectors intended to check code generation outputs.
- To claim production readiness of an LLM‑only factuality detector without retrieval or human review.
- For domains not represented in FELM (e.g., multimodal or long‑running dialogue states).
Failure Modes
- Judge bias: models find it hard to detect errors they themselves produced (self‑detection gap).
- Long responses: sparse errors in long outputs are harder to find, especially in writing/recommendation domain.
- Claim extraction limits: math and multi‑step reasoning samples resist atomic claim extraction, breaking claim‑based checks.
Core Entities
Models
- ChatGPT
- GPT-4
- Vicuna-33B
Metrics
- F1
- Precision
- Recall
- Accuracy
Datasets
- FELM
- GSM8K
- MATH
- MMLU
- TruthfulQA
Benchmarks
- FELM
- HaluEval
- FEVER
- SummEval
Context Entities
Models
- text-davinci-003
Datasets
- FEVER
- FactCC
- QAGS

