FEWL: score and reduce LLM hallucination using other LLMs instead of human gold labels

Overview

Decision SnapshotReady For Pilot

FEWL is practically usable: it delivers reliable gains in benchmarks and low evaluation cost, but relies on at least one reasonably competent reference LLM and needs extra queries for IW/CO and neighbors.

Citations8

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 55%

Authors

Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu

Links

Abstract / PDF / Data

Why It Matters For Business

FEWL lets teams detect and reduce hallucination cheaply when human gold labels are unavailable, cutting annotation cost and speeding up iteration on safety and quality.

Who Should Care

Product Manager CTO ML Engineer Data Scientist Founder

Summary TLDR

The paper introduces FEWL, a practical metric that scores how much an LLM answer hallucinates when no human 'gold' answers exist. FEWL uses multiple off‑the‑shelf LLMs as reference judges, estimates each reference's expertise by checking how they disagree with intentionally wrong answers (IW) and agree with corrected versions (CO), and penalizes 'lazy' answers that repeat similar responses across nearby questions. The method has a theoretical guarantee (in expectation) to prefer the least-hallucinating model, improves automatic measurement accuracy on CHALE/Truthful-QA/HaluEval versus naive baselines, is cheap to run, and can guide in-context learning and label-free fine-tuning to reducehall

Problem Statement

Measuring LLM hallucination usually needs human-written gold answers. That is costly and error-prone. The goal: build an automatic, low-cost hallucination metric that works when no gold answers exist and can also help reduce hallucination.

Main Contribution

FEWL metric: weighs multiple off‑the‑shelf LLMs to score an answer's factualness without human gold answers.

Expertise estimation: infer each reference LLM's reliability by measuring disagreement with intentionally wrong answers and agreement with corrected variants.

Key Findings

FEWL gives more accurate hallucination scores than simple baselines on CHALE.

NumbersFEWL: 70.36% vs best baseline ~68.95% (Non-hallu vs Hallu on CHALE)

Practical UseUse FEWL instead of single-LLM similarity to better distinguish correct vs hallucinated answers when no gold labels exist.

Evidence RefTable 1 (CHALE, GPT-4/GPT-3.5 comparisons)

FEWL achieves near-perfect measurement on a large benchmark.

NumbersHaluEval measurement accuracy: FEWL 98.15% (vs 94.33% baseline)

Practical UseFEWL can reliably flag hallucinations on large QA datasets without human answers.

Evidence RefAppendix C.6, Table 10 (HaluEval)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	FEWL (GPT-4): 73.18% ±0.20	single + no penalty (GPT-4): 65.88% ±0.12	+7.3 pp	CHALE	Table 1 (main paper)	Table 1
CHALE: Non-hallu vs Hallu (FEWL with GPT-3.5)	FEWL (GPT-3.5): 70.36% ±0.33	single + no penalty (GPT-3.5): 62.89% ±0.28	+7.47 pp	CHALE	Table 1 (main paper)	Table 1

What To Try In 7 Days

Run FEWL on a problem area where you lack gold labels to spot common hallucination topics.

Use FEWL-selected examples as in-context prompts to re-run generation and compare outputs.

Build a small label-free SFT pipeline: fine-tune a small model on FEWL-selected best answers and evaluate improvements.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/weijiaheng/CHALE https://github.com/manyoso/haltt4llm

Risks & Boundaries

Limitations

Needs at least one reasonably expert reference LLM; if all references lack domain knowledge FEWL fails.

Slower than directly comparing to gold answers because IW/CO generation and neighbor searches add queries.

When Not To Use

When no reference LLM has any expertise on the domain or topic.

When you need single-query, ultra-low-latency scoring in production.

Failure Modes

Reference LLM judge bias: a powerful but biased reference can systematically mis-rank answers.

Poor IW/CO generation: synthetic wrongs or corrections that are low-quality can misestimate expertise weights.

Core Entities

Models

GPT-3.5GPT-4Falcon-7BFlan-t5-largeFlan-alpaca-baseFlan-alpaca-largeLLaMAOPT-1.3BText-davinci-003Flan-t5-baseGPT-35-turbo

Metrics

FEWL scoreAccuracySFT

Datasets

CHALETruthful-QAHaluEval

Benchmarks

Truthful-QA multiple-choiceCHALE QA evaluationHaluEval QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FEWL gives more accurate hallucination scores than simple baselines on CHALE.

FEWL achieves near-perfect measurement on a large benchmark.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

Bi'an: a bilingual RAG hallucination benchmark plus small fine-tuned judge models

Key finding

LLMs misjudge mixed-context hallucinations: external retrieval helps but factual cases remain hard

Key finding

MultiHal: a multilingual, Wikidata-grounded benchmark that uses KG paths to evaluate and reduce LLM hallucinations

Key finding