FELM: a fine‑grained benchmark that tests factuality detectors across five domains

October 1, 20237 min

Overview

Decision SnapshotNeeds Validation

FELM is a useful, trustworthy benchmark for testing factuality detectors across domains; expect measurable retrieval gains but not turnkey production detectors.

Citations12

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 30%

Novelty: 40%

Authors

Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, Junxian He

Links

Abstract / PDF / Data

Why It Matters For Business

Automated factuality checks are needed: one in three long ChatGPT responses in this benchmark contains an error, and current LLM-only detectors miss many mistakes—so businesses should add retrieval and human oversight before trusting model outputs.

Who Should Care

Summary TLDR

FELM is a new, human‑annotated benchmark for evaluating factuality detectors on long LLM outputs. It contains segment‑level labels, error types, reasons, and reference links across five domains (world knowledge, science/tech, math, reasoning, writing/recommendation). Experiments show off‑the‑shelf LLMs (ChatGPT, GPT-4, Vicuna) struggle to reliably detect errors; retrieval helps (≈5–6 F1 points), but overall detection remains far from production‑ready.

Problem Statement

Current factuality benchmarks focus on specific tasks (e.g., summarization) or domains (Wikipedia). We lack a meta‑evaluation dataset that (1) contains authentic LLM errors, (2) covers diverse domains beyond world knowledge, and (3) provides fine‑grained segment labels plus references so we can measure and improve factuality evaluators.

Main Contribution

FELM dataset: human‑annotated, segment‑level factuality labels with error types, reasons, and reference links.

Domain breadth: covers five domains—world knowledge, science/tech, math, reasoning, and writing/recommendation.

Key Findings

FELM covers five realistic domains and contains thousands of fine‑grained segments.

Numbers847 samples, 4,425 segments; avg response 89.1 tokens

Practical UseUse FELM to test detectors on varied, real LLM errors rather than only Wikipedia or summaries.

Evidence RefTable 2 (§3.3)

A large share of ChatGPT outputs contain factual errors.

NumbersResponse‑level error rate 33.3%

Practical UseExpect about one in three long LLM responses to include at least one factual error in similar zero‑shot settings.

Evidence RefTable 2 (§3.3)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Response-level error rate33.3%FELM (all domains)Table 2: Error rate§3.3 Table 2
Accuracy67.1%Vanilla GPT-4 seg 60.7%+6.4 ptsFELM (overall)Table 10Table 10

What To Try In 7 Days

Run FELM's public dataset on your detector to get a domain‑diverse baseline.

Add simple retrieval (BM25 + source links) to your fact‑checker and measure F1 lift.

Switch long responses to segment/claim checking to localize and present errors to users.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

FELM responses were generated only by ChatGPT, so detectors may perform differently on outputs from other LLMs.

Dataset size is modest (hundreds of samples per domain) due to costly expert annotation.

When Not To Use

As the sole validation for detectors intended to check code generation outputs.

To claim production readiness of an LLM‑only factuality detector without retrieval or human review.

Failure Modes

Judge bias: models find it hard to detect errors they themselves produced (self‑detection gap).

Long responses: sparse errors in long outputs are harder to find, especially in writing/recommendation domain.

Core Entities

Models

ChatGPTGPT-4Vicuna-33B

Metrics

F1PrecisionRecallAccuracy

Datasets

FELMGSM8KMATHMMLUTruthfulQA

Benchmarks

FELMHaluEvalFEVERSummEval

Context Entities

Models

text-davinci-003

Datasets

FEVERFactCCQAGS