FEWL: score and reduce LLM hallucination using other LLMs instead of human gold labels

February 16, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.8

Citation Count

8

Authors

Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu

Links

Abstract / PDF

Why It Matters For Business

FEWL lets teams detect and reduce hallucination cheaply when human gold labels are unavailable, cutting annotation cost and speeding up iteration on safety and quality.

Summary TLDR

The paper introduces FEWL, a practical metric that scores how much an LLM answer hallucinates when no human 'gold' answers exist. FEWL uses multiple off‑the‑shelf LLMs as reference judges, estimates each reference's expertise by checking how they disagree with intentionally wrong answers (IW) and agree with corrected versions (CO), and penalizes 'lazy' answers that repeat similar responses across nearby questions. The method has a theoretical guarantee (in expectation) to prefer the least-hallucinating model, improves automatic measurement accuracy on CHALE/Truthful-QA/HaluEval versus naive baselines, is cheap to run, and can guide in-context learning and label-free fine-tuning to reducehall

Problem Statement

Measuring LLM hallucination usually needs human-written gold answers. That is costly and error-prone. The goal: build an automatic, low-cost hallucination metric that works when no gold answers exist and can also help reduce hallucination.

Main Contribution

FEWL metric: weighs multiple off‑the‑shelf LLMs to score an answer's factualness without human gold answers.

Expertise estimation: infer each reference LLM's reliability by measuring disagreement with intentionally wrong answers and agreement with corrected variants.

Laziness penalty: penalize reference LLMs that give similar answers across topic-nearby questions to avoid relying on superficial patterns.

Theoretical guarantee: under mild assumptions FEWL tends to rank the least-hallucinating model highest in expectation.

Practical uses: FEWL improves automated measurement accuracy and can guide in-context learning and label-free supervised fine-tuning.

Key Findings

FEWL gives more accurate hallucination scores than simple baselines on CHALE.

NumbersFEWL: 70.36% vs best baseline ~68.95% (Non-hallu vs Hallu on CHALE)

FEWL achieves near-perfect measurement on a large benchmark.

NumbersHaluEval measurement accuracy: FEWL 98.15% (vs 94.33% baseline)

Label-free supervised fine-tuning guided by FEWL improves model outputs.

NumbersSFT win rate (GPT-4 selected data): FEWL 71.58% vs baseline 66.67% (judged by GPT-4)

FEWL is much cheaper than hiring human annotators for hallucination evaluation.

NumbersCost: FEWL ≲ $0.3 per 1K samples with GPT-3.5 vs human > $16/hour

FEWL has a theoretical guarantee to prefer the best model in expectation.

Results

Accuracy

ValueFEWL (GPT-4): 73.18% ±0.20

Baselinesingle + no penalty (GPT-4): 65.88% ±0.12

CHALE: Non-hallu vs Hallu (FEWL with GPT-3.5)

ValueFEWL (GPT-3.5): 70.36% ±0.33

Baselinesingle + no penalty (GPT-3.5): 62.89% ±0.28

Accuracy

ValueFEWL: 98.15% ±0.14

Baselinesingle + no penalty: 94.33% ±0.14

Truthful-QA: Count of 'best' answers ranked highest

ValueFEWL (GPT-4): 202 ± 5.59 (higher-is-better)

Baselinesingle + no penalty (varies): ~178 ± 2.68

SFT

ValueFEWL (GPT-4-selected samples): 71.58%

BaselineBaseline selection (GPT-4): 66.67%

Who Should Care

What To Try In 7 Days

Run FEWL on a problem area where you lack gold labels to spot common hallucination topics.

Use FEWL-selected examples as in-context prompts to re-run generation and compare outputs.

Build a small label-free SFT pipeline: fine-tune a small model on FEWL-selected best answers and evaluate improvements.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Needs at least one reasonably expert reference LLM; if all references lack domain knowledge FEWL fails.
  • Slower than directly comparing to gold answers because IW/CO generation and neighbor searches add queries.
  • Laziness penalty depends on good neighbor-question selection; small or noisy datasets reduce effectiveness.

When Not To Use

  • When no reference LLM has any expertise on the domain or topic.
  • When you need single-query, ultra-low-latency scoring in production.
  • When you can afford high-quality human gold labels for the target task.

Failure Modes

  • Reference LLM judge bias: a powerful but biased reference can systematically mis-rank answers.
  • Poor IW/CO generation: synthetic wrongs or corrections that are low-quality can misestimate expertise weights.
  • Neighbor questions that are too similar or too dissimilar can make the laziness penalty misfire.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • Falcon-7B
  • Flan-t5-large
  • Flan-alpaca-base
  • Flan-alpaca-large
  • LLaMA
  • OPT-1.3B
  • Text-davinci-003
  • Flan-t5-base
  • GPT-35-turbo

Metrics

  • FEWL score
  • Accuracy
  • SFT

Datasets

  • CHALE
  • Truthful-QA
  • HaluEval

Benchmarks

  • Truthful-QA multiple-choice
  • CHALE QA evaluation
  • HaluEval QA