Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
4
Why It Matters For Business
ICD is a low-risk intervention to reduce factual errors at runtime without retraining the whole model; it can improve user trust in QA and content generation pipelines while requiring modest extra compute.
Summary TLDR
The paper introduces Induce-then-Contrast Decoding (ICD). First, they create a purposely ‘weak’ model that tends to fabricate facts by fine-tuning on synthetic or real non-factual examples. Then, during decoding they contrast the original model with this weak model and down-weight tokens favored by the hallucination model. ICD improves truthfulness on TruthfulQA and factual precision on FACTSCORE across multiple open LLMs and sizes, at the cost of ~1.6x generation latency. Code and data are provided.
Problem Statement
Large language models sometimes generate false facts (hallucinations). Changing pretraining or supervised fine-tuning is costly and can backfire. The paper asks: can a decoding-time method reduce factual errors by constructing a model that hallucinates and using it as a penalty during generation?
Main Contribution
Propose ICD: build a factually weak model (induced hallucinations) and apply contrastive decoding to penalize hallucinated token probabilities.
Show ICD improves truthfulness on discriminative QA (TruthfulQA) and factual precision on generation (FACTSCORE) across Llama2, Baichuan2, and Mistral.
Study induction methods: fine-tuning on synthetic hallucinations (10k) works best, prompts and small real failure sets help less.
Measure costs and limits: ICD increases latency (~1.6x) but preserves core task accuracy (MMLU, ARC, AlpacaEval2.0).
Key Findings
ICD (finetuning-based induction) raises Llama2-7B-Chat TruthfulQA MC1 by +8.70, MC2 by +14.48, MC3 by +13.13
ICD improves factual precision on FACTSCORE from 63.8 to 66.3 for Llama2-7B-Chat (+2.5 points)
ICD benefits grow with model scale; Llama2-70B-Chat saw MC1 +13.34, MC2 +16.02, MC3 +16.75
Fine-tuning induction (10k synthetic hallucinations) outperforms prompt-based induction and small real failure sets; 294 real samples beat 1k synthetic but not 10k synthetic
ICD raises runtime cost: contrastive decoding requires two forward passes and increases latency by ~1.6x
Results
TruthfulQA (Llama2-7B-Chat) MC1
TruthfulQA (Llama2-7B-Chat) MC2
FACTSCORE (Llama2-7B-Chat) score
TruthfulQA (Mistral-7B-Instruct) MC1
Latency increase
Who Should Care
What To Try In 7 Days
Fine-tune a small 'anti-expert' with LoRA on 1–10k synthetic hallucinated samples and run ICD at inference.
Measure truthfulness on a small domain set (TruthfulQA subset or internal QA checks) before/after.
If latency matters, contrast the main model with a smaller anti-model to cut cost and re-check quality/latency tradeoffs.
Optimization Features
Training Optimization
- LoRA
Inference Optimization
- Contrastive decoding doubles forward passes; can use smaller anti-model
Reproducibility
Code Urls
Data Urls
- https://github.com/hillzhang1999/ICD
- HaluEval (Li et al. 2023) and TruthfulQA/FACTSCORE (public benchmarks referenced)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Adds inference cost: contrastive decoding runs two forward passes (~1.6x latency).
- Evaluated on two domains (TruthfulQA, FACTSCORE); generality to all tasks is unproven.
- Best results require thousands of induced hallucination samples; smaller real failure sets help but may need scale.
When Not To Use
- When strict low-latency requirements rule out extra forward passes.
- When you cannot obtain any model logits (black-box API only).
- When you lack budget or tooling to generate or curate thousands of hallucination examples.
Failure Modes
- If the hallucination model is poorly matched, ICD can penalize useful tokens and harm quality.
- Direct fine-tuning on factual data without contrast can increase hallucinations and response ratio.
- Reversed-contrast setups can produce fluent but fully fabricated text.
Core Entities
Models
- Llama2-7B-Chat
- Llama2-7B-Base
- Llama2-13B-Chat
- Llama2-70B-Chat
- Mistral-7B-Instruct
- Baichuan2-7B-Chat
- ChatGPT
- GPT4
Metrics
- MC1
- MC2
- MC3
- FACTSCORE (%response, #facts, score)
- MMLU 5-shot
- ARC 5-shot
- AlpacaEval2.0 win rate vs GPT-4-turbo
Datasets
- TruthfulQA
- FACTSCORE
- HaluEval
- Wikipedia (for BIOs)
Benchmarks
- TruthfulQA
- FACTSCORE
- MMLU
- ARC
- AlpacaEval2.0

