Train detectors by teaching models with high-quality fake answers

May 23, 20257 min

Overview

Decision SnapshotReady For Pilot

The approach is practical: small models get large accuracy gains by swapping naive negatives for vetted hallucinations and using a staged curriculum; evidence comes from multi-benchmark and ablation results but is centered on QA tasks and depends on MiniCheck.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shrey Pandit, Ashwin Vinod, Liu Leqi, Ying Ding

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can build smaller, cheaper detectors that match or beat larger models by training with vetted synthetic hallucinations and a difficulty curriculum, reducing deployment cost and improving safety oversight.

Who Should Care

Summary TLDR

This paper trains small hallucination detectors by using deliberately crafted, high-quality hallucinated answers as negative examples in Direct Preference Optimization (DPO). Negatives are ranked by a fact-checker (MiniCheck) and fed in an easy-to-hard curriculum. HaluCheck detectors (Llama-3.2 1B/3B + LoRA, DPO) beat same-size and many larger baselines on MedHallu and HaluEval (up to +24% relative gains on hard benchmarks) and show consistent zero-shot gains on external QA sets. Key caveats: the method depends on MiniCheck scores and is evaluated mainly on QA-style hallucinations.

Problem Statement

Detecting high-quality hallucinations is hard because naive negative examples (random failed outputs) are too easy and do not teach subtler falsehoods. The authors ask: can we train small detectors more effectively by replacing low-quality negatives with vetted synthetic hallucinations and presenting them in increasing difficulty during DPO alignment?

Main Contribution

Curriculum DPO: apply Direct Preference Optimization with a curriculum that moves from easier to harder hallucinated negatives ranked by MiniCheck.

HaluCheck detectors: two lightweight detectors (Llama-3.2 1B and 3B with LoRA) trained with curated hallucinations and DPO.

Key Findings

Curriculum DPO with curated hallucinated negatives lifts MedHallu F1 to 0.759 for HaluCheck 3B.

NumbersMedHallu F1 = 0.759 (HaluCheck 3B)

Practical UseUse difficulty-ranked hallucinations and DPO to substantially improve a small (3B) detector for in-domain medical QA.

Evidence RefTable 1, Sec.5.1

HaluCheck 3B achieves 0.753 F1 on HaluEval and outperforms the Llama-3.2 3B baseline.

NumbersHaluEval F1 = 0.753 (HaluCheck 3B)

Practical UseThe method generalizes to another benchmark and improves cross-benchmark detection accuracy.

Evidence RefTable 1, Sec.5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MedHallu F1 (HaluCheck 3B)0.759Llama-3.2 3Breported +26% relative vs baseline in-textMedHalluTable 1, Sec.5.1Table 1
HaluEval F1 (HaluCheck 3B)0.753Llama-3.2 3B (0.726)absolute +0.027 F1HaluEvalTable 1, Sec.5.1Table 1

What To Try In 7 Days

Run MiniCheck on your QA outputs to score grounding.

Collect high-quality hallucinated negatives from your datasets (keep MiniCheck score > 0.25).

Fine-tune a small backbone (e.g., Llama-3.2 3B) with LoRA under DPO using preference pairs (gold vs hallucination).

Optimization Features

Infra Optimization
8-bit optimizer states and mixed-precision (FP16) during training
Model Optimization
LoRA
System Optimization
vLLM batching
Training Optimization
Curriculum sampling (easy→hard negatives)DPO objective for preference optimization

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on MiniCheck fact-verifier; verifier biases/errors can propagate into the detector.

Evaluations focus on QA-style hallucinations; results may not transfer to dialog or summarization.

When Not To Use

When no reliable external fact verifier exists for your domain.

When you need span-level or graded hallucination localization rather than binary labels.

Failure Modes

False positives caused by verifier labeling errors or cultural/domain knowledge gaps.

Detector overfits to dataset-specific adversarial styles and loses generality.

Core Entities

Models

LoRALlama-3.2 1BLlama-3.2 3BGPT-3.5-TurboGPT-4oQwen-2.5 (various sizes)

Metrics

F1PrecisionAccuracyRecallMiniCheck grounded factuality (probability score)

Datasets

MedHalluHaluEvalDROPCovidQAPubMedQAHaluBench

Benchmarks

MedHalluHaluEvalHaluBench zero-shot QA sets