FELM: a fine‑grained benchmark that tests factuality detectors across five domains

October 1, 20237 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.3

Citation Count

12

Authors

Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, Junxian He

Links

Abstract / PDF

Why It Matters For Business

Automated factuality checks are needed: one in three long ChatGPT responses in this benchmark contains an error, and current LLM-only detectors miss many mistakes—so businesses should add retrieval and human oversight before trusting model outputs.

Summary TLDR

FELM is a new, human‑annotated benchmark for evaluating factuality detectors on long LLM outputs. It contains segment‑level labels, error types, reasons, and reference links across five domains (world knowledge, science/tech, math, reasoning, writing/recommendation). Experiments show off‑the‑shelf LLMs (ChatGPT, GPT-4, Vicuna) struggle to reliably detect errors; retrieval helps (≈5–6 F1 points), but overall detection remains far from production‑ready.

Problem Statement

Current factuality benchmarks focus on specific tasks (e.g., summarization) or domains (Wikipedia). We lack a meta‑evaluation dataset that (1) contains authentic LLM errors, (2) covers diverse domains beyond world knowledge, and (3) provides fine‑grained segment labels plus references so we can measure and improve factuality evaluators.

Main Contribution

FELM dataset: human‑annotated, segment‑level factuality labels with error types, reasons, and reference links.

Domain breadth: covers five domains—world knowledge, science/tech, math, reasoning, and writing/recommendation.

Evaluation suite: benchmarks LLM evaluators (Vicuna-33B, ChatGPT, GPT‑4) under vanilla, chain‑of‑thought, link‑augmented, and doc‑augmented settings.

Findings: retrieval/document augmentation consistently improves detection; still, even GPT‑4 is far from reliable across all domains.

Key Findings

FELM covers five realistic domains and contains thousands of fine‑grained segments.

Numbers847 samples, 4,425 segments; avg response 89.1 tokens

A large share of ChatGPT outputs contain factual errors.

NumbersResponse‑level error rate 33.3%

Human labels are consistent and references are reliable.

NumbersAnnotator agreement 91.3%; reference reliability check 100% on 100 samples

Retrieval/document augmentation improves factuality detection.

NumbersChatGPT doc: +6.4 F1 (segment); GPT‑4 doc: +5.5 F1 (segment)

Off‑the‑shelf LLM evaluators still perform poorly overall.

NumbersOnly GPT‑4 achieves overall F1 >40 in some settings; many ChatGPT detectors fail without retrieval

Results

Response-level error rate

Value33.3%

Accuracy

Value67.1%

BaselineVanilla GPT-4 seg 60.7%

Segment-level F1 increase from retrieval (ChatGPT)

Value+6.4 F1 (avg)

BaselineChatGPT vanilla

Annotator agreement

Value91.3%

Who Should Care

What To Try In 7 Days

Run FELM's public dataset on your detector to get a domain‑diverse baseline.

Add simple retrieval (BM25 + source links) to your fact‑checker and measure F1 lift.

Switch long responses to segment/claim checking to localize and present errors to users.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • FELM responses were generated only by ChatGPT, so detectors may perform differently on outputs from other LLMs.
  • Dataset size is modest (hundreds of samples per domain) due to costly expert annotation.
  • Code generation and other application scenarios were not included and should be added later.

When Not To Use

  • As the sole validation for detectors intended to check code generation outputs.
  • To claim production readiness of an LLM‑only factuality detector without retrieval or human review.
  • For domains not represented in FELM (e.g., multimodal or long‑running dialogue states).

Failure Modes

  • Judge bias: models find it hard to detect errors they themselves produced (self‑detection gap).
  • Long responses: sparse errors in long outputs are harder to find, especially in writing/recommendation domain.
  • Claim extraction limits: math and multi‑step reasoning samples resist atomic claim extraction, breaking claim‑based checks.

Core Entities

Models

  • ChatGPT
  • GPT-4
  • Vicuna-33B

Metrics

  • F1
  • Precision
  • Recall
  • Accuracy

Datasets

  • FELM
  • GSM8K
  • MATH
  • MMLU
  • TruthfulQA

Benchmarks

  • FELM
  • HaluEval
  • FEVER
  • SummEval

Context Entities

Models

  • text-davinci-003

Datasets

  • FEVER
  • FactCC
  • QAGS