Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

December 25, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

4

Authors

Yue Zhang, Leyang Cui, Wei Bi, Shuming Shi

Links

Abstract / PDF

Why It Matters For Business

ICD is a low-risk intervention to reduce factual errors at runtime without retraining the whole model; it can improve user trust in QA and content generation pipelines while requiring modest extra compute.

Summary TLDR

The paper introduces Induce-then-Contrast Decoding (ICD). First, they create a purposely ‘weak’ model that tends to fabricate facts by fine-tuning on synthetic or real non-factual examples. Then, during decoding they contrast the original model with this weak model and down-weight tokens favored by the hallucination model. ICD improves truthfulness on TruthfulQA and factual precision on FACTSCORE across multiple open LLMs and sizes, at the cost of ~1.6x generation latency. Code and data are provided.

Problem Statement

Large language models sometimes generate false facts (hallucinations). Changing pretraining or supervised fine-tuning is costly and can backfire. The paper asks: can a decoding-time method reduce factual errors by constructing a model that hallucinates and using it as a penalty during generation?

Main Contribution

Propose ICD: build a factually weak model (induced hallucinations) and apply contrastive decoding to penalize hallucinated token probabilities.

Show ICD improves truthfulness on discriminative QA (TruthfulQA) and factual precision on generation (FACTSCORE) across Llama2, Baichuan2, and Mistral.

Study induction methods: fine-tuning on synthetic hallucinations (10k) works best, prompts and small real failure sets help less.

Measure costs and limits: ICD increases latency (~1.6x) but preserves core task accuracy (MMLU, ARC, AlpacaEval2.0).

Key Findings

ICD (finetuning-based induction) raises Llama2-7B-Chat TruthfulQA MC1 by +8.70, MC2 by +14.48, MC3 by +13.13

NumbersMC1 +8.70; MC2 +14.48; MC3 +13.13 (Table 1)

ICD improves factual precision on FACTSCORE from 63.8 to 66.3 for Llama2-7B-Chat (+2.5 points)

NumbersFACTSCORE score +2.5 (63.8 → 66.3) (Table 2)

ICD benefits grow with model scale; Llama2-70B-Chat saw MC1 +13.34, MC2 +16.02, MC3 +16.75

Numbers70B: MC1 +13.34; MC2 +16.02; MC3 +16.75 (Table 5)

Fine-tuning induction (10k synthetic hallucinations) outperforms prompt-based induction and small real failure sets; 294 real samples beat 1k synthetic but not 10k synthetic

NumbersPrompt-based: MC1 37.87 vs baseline 37.62; Real(294): MC1 39.22; Synthetic(10k): MC1 46.32 (Tables 1,6)

ICD raises runtime cost: contrastive decoding requires two forward passes and increases latency by ~1.6x

NumbersLatency increase ≈ 1.6x (Limitations)

Results

TruthfulQA (Llama2-7B-Chat) MC1

Value37.62 → 46.32 (+8.70)

Baselinegreedy decoding

TruthfulQA (Llama2-7B-Chat) MC2

Value54.60 → 69.08 (+14.48)

Baselinegreedy decoding

FACTSCORE (Llama2-7B-Chat) score

Value63.8 → 66.3 (+2.5)

Baselinegreedy decoding

TruthfulQA (Mistral-7B-Instruct) MC1

Value39.09 → 58.53 (+19.44)

Baselinegreedy decoding

Latency increase

Value≈1.6x

Baselinegreedy decoding

Who Should Care

What To Try In 7 Days

Fine-tune a small 'anti-expert' with LoRA on 1–10k synthetic hallucinated samples and run ICD at inference.

Measure truthfulness on a small domain set (TruthfulQA subset or internal QA checks) before/after.

If latency matters, contrast the main model with a smaller anti-model to cut cost and re-check quality/latency tradeoffs.

Optimization Features

Training Optimization

  • LoRA

Inference Optimization

  • Contrastive decoding doubles forward passes; can use smaller anti-model

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Adds inference cost: contrastive decoding runs two forward passes (~1.6x latency).
  • Evaluated on two domains (TruthfulQA, FACTSCORE); generality to all tasks is unproven.
  • Best results require thousands of induced hallucination samples; smaller real failure sets help but may need scale.

When Not To Use

  • When strict low-latency requirements rule out extra forward passes.
  • When you cannot obtain any model logits (black-box API only).
  • When you lack budget or tooling to generate or curate thousands of hallucination examples.

Failure Modes

  • If the hallucination model is poorly matched, ICD can penalize useful tokens and harm quality.
  • Direct fine-tuning on factual data without contrast can increase hallucinations and response ratio.
  • Reversed-contrast setups can produce fluent but fully fabricated text.

Core Entities

Models

  • Llama2-7B-Chat
  • Llama2-7B-Base
  • Llama2-13B-Chat
  • Llama2-70B-Chat
  • Mistral-7B-Instruct
  • Baichuan2-7B-Chat
  • ChatGPT
  • GPT4

Metrics

  • MC1
  • MC2
  • MC3
  • FACTSCORE (%response, #facts, score)
  • MMLU 5-shot
  • ARC 5-shot
  • AlpacaEval2.0 win rate vs GPT-4-turbo

Datasets

  • TruthfulQA
  • FACTSCORE
  • HaluEval
  • Wikipedia (for BIOs)

Benchmarks

  • TruthfulQA
  • FACTSCORE
  • MMLU
  • ARC
  • AlpacaEval2.0