Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.8
Citation Count
8
Why It Matters For Business
FEWL lets teams detect and reduce hallucination cheaply when human gold labels are unavailable, cutting annotation cost and speeding up iteration on safety and quality.
Summary TLDR
The paper introduces FEWL, a practical metric that scores how much an LLM answer hallucinates when no human 'gold' answers exist. FEWL uses multiple off‑the‑shelf LLMs as reference judges, estimates each reference's expertise by checking how they disagree with intentionally wrong answers (IW) and agree with corrected versions (CO), and penalizes 'lazy' answers that repeat similar responses across nearby questions. The method has a theoretical guarantee (in expectation) to prefer the least-hallucinating model, improves automatic measurement accuracy on CHALE/Truthful-QA/HaluEval versus naive baselines, is cheap to run, and can guide in-context learning and label-free fine-tuning to reducehall
Problem Statement
Measuring LLM hallucination usually needs human-written gold answers. That is costly and error-prone. The goal: build an automatic, low-cost hallucination metric that works when no gold answers exist and can also help reduce hallucination.
Main Contribution
FEWL metric: weighs multiple off‑the‑shelf LLMs to score an answer's factualness without human gold answers.
Expertise estimation: infer each reference LLM's reliability by measuring disagreement with intentionally wrong answers and agreement with corrected variants.
Laziness penalty: penalize reference LLMs that give similar answers across topic-nearby questions to avoid relying on superficial patterns.
Theoretical guarantee: under mild assumptions FEWL tends to rank the least-hallucinating model highest in expectation.
Practical uses: FEWL improves automated measurement accuracy and can guide in-context learning and label-free supervised fine-tuning.
Key Findings
FEWL gives more accurate hallucination scores than simple baselines on CHALE.
FEWL achieves near-perfect measurement on a large benchmark.
Label-free supervised fine-tuning guided by FEWL improves model outputs.
FEWL is much cheaper than hiring human annotators for hallucination evaluation.
FEWL has a theoretical guarantee to prefer the best model in expectation.
Results
Accuracy
CHALE: Non-hallu vs Hallu (FEWL with GPT-3.5)
Accuracy
Truthful-QA: Count of 'best' answers ranked highest
SFT
Who Should Care
What To Try In 7 Days
Run FEWL on a problem area where you lack gold labels to spot common hallucination topics.
Use FEWL-selected examples as in-context prompts to re-run generation and compare outputs.
Build a small label-free SFT pipeline: fine-tune a small model on FEWL-selected best answers and evaluate improvements.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Needs at least one reasonably expert reference LLM; if all references lack domain knowledge FEWL fails.
- Slower than directly comparing to gold answers because IW/CO generation and neighbor searches add queries.
- Laziness penalty depends on good neighbor-question selection; small or noisy datasets reduce effectiveness.
When Not To Use
- When no reference LLM has any expertise on the domain or topic.
- When you need single-query, ultra-low-latency scoring in production.
- When you can afford high-quality human gold labels for the target task.
Failure Modes
- Reference LLM judge bias: a powerful but biased reference can systematically mis-rank answers.
- Poor IW/CO generation: synthetic wrongs or corrections that are low-quality can misestimate expertise weights.
- Neighbor questions that are too similar or too dissimilar can make the laziness penalty misfire.
Core Entities
Models
- GPT-3.5
- GPT-4
- Falcon-7B
- Flan-t5-large
- Flan-alpaca-base
- Flan-alpaca-large
- LLaMA
- OPT-1.3B
- Text-davinci-003
- Flan-t5-base
- GPT-35-turbo
Metrics
- FEWL score
- Accuracy
- SFT
Datasets
- CHALE
- Truthful-QA
- HaluEval
Benchmarks
- Truthful-QA multiple-choice
- CHALE QA evaluation
- HaluEval QA

