FELM: a fine‑grained benchmark that tests factuality detectors across five domains

Overview

Decision SnapshotNeeds Validation

FELM is a useful, trustworthy benchmark for testing factuality detectors across domains; expect measurable retrieval gains but not turnkey production detectors.

Citations12

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 30%

Novelty: 40%

Authors

Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, Junxian He

Links

Abstract / PDF / Data

Why It Matters For Business

Automated factuality checks are needed: one in three long ChatGPT responses in this benchmark contains an error, and current LLM-only detectors miss many mistakes—so businesses should add retrieval and human oversight before trusting model outputs.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

FELM is a new, human‑annotated benchmark for evaluating factuality detectors on long LLM outputs. It contains segment‑level labels, error types, reasons, and reference links across five domains (world knowledge, science/tech, math, reasoning, writing/recommendation). Experiments show off‑the‑shelf LLMs (ChatGPT, GPT-4, Vicuna) struggle to reliably detect errors; retrieval helps (≈5–6 F1 points), but overall detection remains far from production‑ready.

Problem Statement

Current factuality benchmarks focus on specific tasks (e.g., summarization) or domains (Wikipedia). We lack a meta‑evaluation dataset that (1) contains authentic LLM errors, (2) covers diverse domains beyond world knowledge, and (3) provides fine‑grained segment labels plus references so we can measure and improve factuality evaluators.

Main Contribution

FELM dataset: human‑annotated, segment‑level factuality labels with error types, reasons, and reference links.

Domain breadth: covers five domains—world knowledge, science/tech, math, reasoning, and writing/recommendation.

Key Findings

FELM covers five realistic domains and contains thousands of fine‑grained segments.

Numbers847 samples, 4,425 segments; avg response 89.1 tokens

Practical UseUse FELM to test detectors on varied, real LLM errors rather than only Wikipedia or summaries.

Evidence RefTable 2 (§3.3)

A large share of ChatGPT outputs contain factual errors.

NumbersResponse‑level error rate 33.3%

Practical UseExpect about one in three long LLM responses to include at least one factual error in similar zero‑shot settings.

Evidence RefTable 2 (§3.3)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Response-level error rate	33.3%	—	—	FELM (all domains)	Table 2: Error rate	§3.3 Table 2
Accuracy	67.1%	Vanilla GPT-4 seg 60.7%	+6.4 pts	FELM (overall)	Table 10	Table 10

What To Try In 7 Days

Run FELM's public dataset on your detector to get a domain‑diverse baseline.

Add simple retrieval (BM25 + source links) to your fact‑checker and measure F1 lift.

Switch long responses to segment/claim checking to localize and present errors to users.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/hkust-nlp/felm https://arxiv.org/abs/2310.00741

Risks & Boundaries

Limitations

FELM responses were generated only by ChatGPT, so detectors may perform differently on outputs from other LLMs.

Dataset size is modest (hundreds of samples per domain) due to costly expert annotation.

When Not To Use

As the sole validation for detectors intended to check code generation outputs.

To claim production readiness of an LLM‑only factuality detector without retrieval or human review.

Failure Modes

Judge bias: models find it hard to detect errors they themselves produced (self‑detection gap).

Long responses: sparse errors in long outputs are harder to find, especially in writing/recommendation domain.

Core Entities

Models

ChatGPTGPT-4Vicuna-33B

Metrics

F1PrecisionRecallAccuracy

Datasets

FELMGSM8KMATHMMLUTruthfulQA

Benchmarks

FELMHaluEvalFEVERSummEval

Context Entities

Models

text-davinci-003

Datasets

FEVERFactCCQAGS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FELM covers five realistic domains and contains thousands of fine‑grained segments.

A large share of ChatGPT outputs contain factual errors.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding