Train detectors by teaching models with high-quality fake answers

May 23, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Shrey Pandit, Ashwin Vinod, Liu Leqi, Ying Ding

Links

Abstract / PDF

Why It Matters For Business

You can build smaller, cheaper detectors that match or beat larger models by training with vetted synthetic hallucinations and a difficulty curriculum, reducing deployment cost and improving safety oversight.

Summary TLDR

This paper trains small hallucination detectors by using deliberately crafted, high-quality hallucinated answers as negative examples in Direct Preference Optimization (DPO). Negatives are ranked by a fact-checker (MiniCheck) and fed in an easy-to-hard curriculum. HaluCheck detectors (Llama-3.2 1B/3B + LoRA, DPO) beat same-size and many larger baselines on MedHallu and HaluEval (up to +24% relative gains on hard benchmarks) and show consistent zero-shot gains on external QA sets. Key caveats: the method depends on MiniCheck scores and is evaluated mainly on QA-style hallucinations.

Problem Statement

Detecting high-quality hallucinations is hard because naive negative examples (random failed outputs) are too easy and do not teach subtler falsehoods. The authors ask: can we train small detectors more effectively by replacing low-quality negatives with vetted synthetic hallucinations and presenting them in increasing difficulty during DPO alignment?

Main Contribution

Curriculum DPO: apply Direct Preference Optimization with a curriculum that moves from easier to harder hallucinated negatives ranked by MiniCheck.

HaluCheck detectors: two lightweight detectors (Llama-3.2 1B and 3B with LoRA) trained with curated hallucinations and DPO.

Empirical validation: show curated negatives + curriculum improve detection, transfer to zero-shot QA benchmarks, and beat larger baselines on MedHallu and HaluEval.

Key Findings

Curriculum DPO with curated hallucinated negatives lifts MedHallu F1 to 0.759 for HaluCheck 3B.

NumbersMedHallu F1 = 0.759 (HaluCheck 3B)

HaluCheck 3B achieves 0.753 F1 on HaluEval and outperforms the Llama-3.2 3B baseline.

NumbersHaluEval F1 = 0.753 (HaluCheck 3B)

Zero-shot accuracy improves on external QA sets: HaluCheck 3B avg 59.16% vs Llama-3.2 3B 54.6%.

NumbersZero-shot Avg Acc = 59.16% vs 54.6%

Curriculum ordering beats random negative sampling: HaluCheck 3B F1 75.9 vs random 69.4 (absolute +6.5 points).

NumbersF1: 75.9 (curr) vs 69.4 (rand)

Curated hallucinated negatives are more grounded (harder to detect) than standard negatives: hard-tier mean MiniCheck 0.391 vs 0.248.

NumbersMiniCheck mean (hard): 0.391 vs 0.248

Results

MedHallu F1 (HaluCheck 3B)

Value0.759

BaselineLlama-3.2 3B

HaluEval F1 (HaluCheck 3B)

Value0.753

BaselineLlama-3.2 3B (0.726)

Accuracy

Value59.16%

BaselineLlama-3.2 3B 54.6%

Curriculum vs Random (F1, HaluCheck 3B)

Value75.9 (curriculum)

Baseline69.4 (random)

Grounded factuality mean (hard tier)

Value0.391 (our hallucinated negatives)

Baseline0.248 (standard negatives)

Who Should Care

What To Try In 7 Days

Run MiniCheck on your QA outputs to score grounding.

Collect high-quality hallucinated negatives from your datasets (keep MiniCheck score > 0.25).

Fine-tune a small backbone (e.g., Llama-3.2 3B) with LoRA under DPO using preference pairs (gold vs hallucination).

Optimization Features

Infra Optimization

  • 8-bit optimizer states and mixed-precision (FP16) during training

Model Optimization

  • LoRA

System Optimization

  • vLLM batching

Training Optimization

  • Curriculum sampling (easy→hard negatives)
  • DPO objective for preference optimization

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on MiniCheck fact-verifier; verifier biases/errors can propagate into the detector.
  • Evaluations focus on QA-style hallucinations; results may not transfer to dialog or summarization.
  • Treats hallucination detection as binary, so it misses partial/span-level hallucinations.
  • Risk of overfitting to adversarial patterns from MedHallu/HaluEval when training on one dataset alone.

When Not To Use

  • When no reliable external fact verifier exists for your domain.
  • When you need span-level or graded hallucination localization rather than binary labels.
  • For generation tasks (dialogue, summarization) without QA-style context grounding.

Failure Modes

  • False positives caused by verifier labeling errors or cultural/domain knowledge gaps.
  • Detector overfits to dataset-specific adversarial styles and loses generality.
  • Misses subtle partial hallucinations because labels are coarse.

Core Entities

Models

  • LoRA
  • Llama-3.2 1B
  • Llama-3.2 3B
  • GPT-3.5-Turbo
  • GPT-4o
  • Qwen-2.5 (various sizes)

Metrics

  • F1
  • Precision
  • Accuracy
  • Recall
  • MiniCheck grounded factuality (probability score)

Datasets

  • MedHallu
  • HaluEval
  • DROP
  • CovidQA
  • PubMedQA
  • HaluBench

Benchmarks

  • MedHallu
  • HaluEval
  • HaluBench zero-shot QA sets