Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Overview

Decision SnapshotNeeds Validation

ICD is a practical decoding-time method that improves factuality on the tested benchmarks; gains are well supported but depend on induced-data scale and add inference cost.

Citations4

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Yue Zhang, Leyang Cui, Wei Bi, Shuming Shi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ICD is a low-risk intervention to reduce factual errors at runtime without retraining the whole model; it can improve user trust in QA and content generation pipelines while requiring modest extra compute.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper introduces Induce-then-Contrast Decoding (ICD). First, they create a purposely ‘weak’ model that tends to fabricate facts by fine-tuning on synthetic or real non-factual examples. Then, during decoding they contrast the original model with this weak model and down-weight tokens favored by the hallucination model. ICD improves truthfulness on TruthfulQA and factual precision on FACTSCORE across multiple open LLMs and sizes, at the cost of ~1.6x generation latency. Code and data are provided.

Problem Statement

Large language models sometimes generate false facts (hallucinations). Changing pretraining or supervised fine-tuning is costly and can backfire. The paper asks: can a decoding-time method reduce factual errors by constructing a model that hallucinates and using it as a penalty during generation?

Main Contribution

Propose ICD: build a factually weak model (induced hallucinations) and apply contrastive decoding to penalize hallucinated token probabilities.

Show ICD improves truthfulness on discriminative QA (TruthfulQA) and factual precision on generation (FACTSCORE) across Llama2, Baichuan2, and Mistral.

Key Findings

ICD (finetuning-based induction) raises Llama2-7B-Chat TruthfulQA MC1 by +8.70, MC2 by +14.48, MC3 by +13.13

NumbersMC1 +8.70; MC2 +14.48; MC3 +13.13 (Table 1)

Practical UseIf you add a 7B hallucination model and apply ICD, expect substantial QA truthfulness gains on evaluated benchmarks.

Evidence RefTable 1, §4.2

ICD improves factual precision on FACTSCORE from 63.8 to 66.3 for Llama2-7B-Chat (+2.5 points)

NumbersFACTSCORE score +2.5 (63.8 → 66.3) (Table 2)

Practical UseFor biography-like generation, ICD increases factual accuracy without increasing response rate or shrinking fact counts.

Evidence RefTable 2, §4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TruthfulQA (Llama2-7B-Chat) MC1	37.62 → 46.32 (+8.70)	greedy decoding	+8.70	TruthfulQA	Table 1 main results	Table 1, §4.2
TruthfulQA (Llama2-7B-Chat) MC2	54.60 → 69.08 (+14.48)	greedy decoding	+14.48	TruthfulQA	Table 1 main results	Table 1, §4.2

What To Try In 7 Days

Fine-tune a small 'anti-expert' with LoRA on 1–10k synthetic hallucinated samples and run ICD at inference.

Measure truthfulness on a small domain set (TruthfulQA subset or internal QA checks) before/after.

If latency matters, contrast the main model with a smaller anti-model to cut cost and re-check quality/latency tradeoffs.

Optimization Features

Training Optimization

LoRA

Inference Optimization

Contrastive decoding doubles forward passes; can use smaller anti-model

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hillzhang1999/ICD

Data URLs

https://github.com/hillzhang1999/ICDHaluEval (Li et al. 2023) and TruthfulQA/FACTSCORE (public benchmarks referenced)

Risks & Boundaries

Limitations

Adds inference cost: contrastive decoding runs two forward passes (~1.6x latency).

Evaluated on two domains (TruthfulQA, FACTSCORE); generality to all tasks is unproven.

When Not To Use

When strict low-latency requirements rule out extra forward passes.

When you cannot obtain any model logits (black-box API only).

Failure Modes

If the hallucination model is poorly matched, ICD can penalize useful tokens and harm quality.

Direct fine-tuning on factual data without contrast can increase hallucinations and response ratio.

Core Entities

Models

Llama2-7B-ChatLlama2-7B-BaseLlama2-13B-ChatLlama2-70B-ChatMistral-7B-InstructBaichuan2-7B-ChatChatGPTGPT4

Metrics

MC1MC2MC3FACTSCORE (%response, #facts, score)MMLU 5-shotARC 5-shotAlpacaEval2.0 win rate vs GPT-4-turbo

Datasets

TruthfulQAFACTSCOREHaluEvalWikipedia (for BIOs)

Benchmarks

TruthfulQAFACTSCOREMMLUARCAlpacaEval2.0

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ICD (finetuning-based induction) raises Llama2-7B-Chat TruthfulQA MC1 by +8.70, MC2 by +14.48, MC3 by +13.13

ICD improves factual precision on FACTSCORE from 63.8 to 66.3 for Llama2-7B-Chat (+2.5 points)

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

KatotohananQA: Filipino TruthfulQA shows ~10–12% accuracy drop vs English; GPT‑5 is multilingual-robust

Key finding