FEWL: score and reduce LLM hallucination using other LLMs instead of human gold labels

February 16, 20247 min

Overview

Decision SnapshotReady For Pilot

FEWL is practically usable: it delivers reliable gains in benchmarks and low evaluation cost, but relies on at least one reasonably competent reference LLM and needs extra queries for IW/CO and neighbors.

Citations8

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 55%

Authors

Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu

Links

Abstract / PDF / Data

Why It Matters For Business

FEWL lets teams detect and reduce hallucination cheaply when human gold labels are unavailable, cutting annotation cost and speeding up iteration on safety and quality.

Who Should Care

Summary TLDR

The paper introduces FEWL, a practical metric that scores how much an LLM answer hallucinates when no human 'gold' answers exist. FEWL uses multiple off‑the‑shelf LLMs as reference judges, estimates each reference's expertise by checking how they disagree with intentionally wrong answers (IW) and agree with corrected versions (CO), and penalizes 'lazy' answers that repeat similar responses across nearby questions. The method has a theoretical guarantee (in expectation) to prefer the least-hallucinating model, improves automatic measurement accuracy on CHALE/Truthful-QA/HaluEval versus naive baselines, is cheap to run, and can guide in-context learning and label-free fine-tuning to reducehall

Problem Statement

Measuring LLM hallucination usually needs human-written gold answers. That is costly and error-prone. The goal: build an automatic, low-cost hallucination metric that works when no gold answers exist and can also help reduce hallucination.

Main Contribution

FEWL metric: weighs multiple off‑the‑shelf LLMs to score an answer's factualness without human gold answers.

Expertise estimation: infer each reference LLM's reliability by measuring disagreement with intentionally wrong answers and agreement with corrected variants.

Key Findings

FEWL gives more accurate hallucination scores than simple baselines on CHALE.

NumbersFEWL: 70.36% vs best baseline ~68.95% (Non-hallu vs Hallu on CHALE)

Practical UseUse FEWL instead of single-LLM similarity to better distinguish correct vs hallucinated answers when no gold labels exist.

Evidence RefTable 1 (CHALE, GPT-4/GPT-3.5 comparisons)

FEWL achieves near-perfect measurement on a large benchmark.

NumbersHaluEval measurement accuracy: FEWL 98.15% (vs 94.33% baseline)

Practical UseFEWL can reliably flag hallucinations on large QA datasets without human answers.

Evidence RefAppendix C.6, Table 10 (HaluEval)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyFEWL (GPT-4): 73.18% ±0.20single + no penalty (GPT-4): 65.88% ±0.12+7.3 ppCHALETable 1 (main paper)Table 1
CHALE: Non-hallu vs Hallu (FEWL with GPT-3.5)FEWL (GPT-3.5): 70.36% ±0.33single + no penalty (GPT-3.5): 62.89% ±0.28+7.47 ppCHALETable 1 (main paper)Table 1

What To Try In 7 Days

Run FEWL on a problem area where you lack gold labels to spot common hallucination topics.

Use FEWL-selected examples as in-context prompts to re-run generation and compare outputs.

Build a small label-free SFT pipeline: fine-tune a small model on FEWL-selected best answers and evaluate improvements.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Needs at least one reasonably expert reference LLM; if all references lack domain knowledge FEWL fails.

Slower than directly comparing to gold answers because IW/CO generation and neighbor searches add queries.

When Not To Use

When no reference LLM has any expertise on the domain or topic.

When you need single-query, ultra-low-latency scoring in production.

Failure Modes

Reference LLM judge bias: a powerful but biased reference can systematically mis-rank answers.

Poor IW/CO generation: synthetic wrongs or corrections that are low-quality can misestimate expertise weights.

Core Entities

Models

GPT-3.5GPT-4Falcon-7BFlan-t5-largeFlan-alpaca-baseFlan-alpaca-largeLLaMAOPT-1.3BText-davinci-003Flan-t5-baseGPT-35-turbo

Metrics

FEWL scoreAccuracySFT

Datasets

CHALETruthful-QAHaluEval

Benchmarks

Truthful-QA multiple-choiceCHALE QA evaluationHaluEval QA