Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

December 25, 20237 min

Overview

Decision SnapshotNeeds Validation

ICD is a practical decoding-time method that improves factuality on the tested benchmarks; gains are well supported but depend on induced-data scale and add inference cost.

Citations4

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 60%

Authors

Yue Zhang, Leyang Cui, Wei Bi, Shuming Shi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ICD is a low-risk intervention to reduce factual errors at runtime without retraining the whole model; it can improve user trust in QA and content generation pipelines while requiring modest extra compute.

Who Should Care

Summary TLDR

The paper introduces Induce-then-Contrast Decoding (ICD). First, they create a purposely ‘weak’ model that tends to fabricate facts by fine-tuning on synthetic or real non-factual examples. Then, during decoding they contrast the original model with this weak model and down-weight tokens favored by the hallucination model. ICD improves truthfulness on TruthfulQA and factual precision on FACTSCORE across multiple open LLMs and sizes, at the cost of ~1.6x generation latency. Code and data are provided.

Problem Statement

Large language models sometimes generate false facts (hallucinations). Changing pretraining or supervised fine-tuning is costly and can backfire. The paper asks: can a decoding-time method reduce factual errors by constructing a model that hallucinates and using it as a penalty during generation?

Main Contribution

Propose ICD: build a factually weak model (induced hallucinations) and apply contrastive decoding to penalize hallucinated token probabilities.

Show ICD improves truthfulness on discriminative QA (TruthfulQA) and factual precision on generation (FACTSCORE) across Llama2, Baichuan2, and Mistral.

Key Findings

ICD (finetuning-based induction) raises Llama2-7B-Chat TruthfulQA MC1 by +8.70, MC2 by +14.48, MC3 by +13.13

NumbersMC1 +8.70; MC2 +14.48; MC3 +13.13 (Table 1)

Practical UseIf you add a 7B hallucination model and apply ICD, expect substantial QA truthfulness gains on evaluated benchmarks.

Evidence RefTable 1, §4.2

ICD improves factual precision on FACTSCORE from 63.8 to 66.3 for Llama2-7B-Chat (+2.5 points)

NumbersFACTSCORE score +2.5 (63.866.3) (Table 2)

Practical UseFor biography-like generation, ICD increases factual accuracy without increasing response rate or shrinking fact counts.

Evidence RefTable 2, §4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TruthfulQA (Llama2-7B-Chat) MC137.6246.32 (+8.70)greedy decoding+8.70TruthfulQATable 1 main resultsTable 1, §4.2
TruthfulQA (Llama2-7B-Chat) MC254.6069.08 (+14.48)greedy decoding+14.48TruthfulQATable 1 main resultsTable 1, §4.2

What To Try In 7 Days

Fine-tune a small 'anti-expert' with LoRA on 1–10k synthetic hallucinated samples and run ICD at inference.

Measure truthfulness on a small domain set (TruthfulQA subset or internal QA checks) before/after.

If latency matters, contrast the main model with a smaller anti-model to cut cost and re-check quality/latency tradeoffs.

Optimization Features

Training Optimization
LoRA
Inference Optimization
Contrastive decoding doubles forward passes; can use smaller anti-model

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/hillzhang1999/ICDHaluEval (Li et al. 2023) and TruthfulQA/FACTSCORE (public benchmarks referenced)

Risks & Boundaries

Limitations

Adds inference cost: contrastive decoding runs two forward passes (~1.6x latency).

Evaluated on two domains (TruthfulQA, FACTSCORE); generality to all tasks is unproven.

When Not To Use

When strict low-latency requirements rule out extra forward passes.

When you cannot obtain any model logits (black-box API only).

Failure Modes

If the hallucination model is poorly matched, ICD can penalize useful tokens and harm quality.

Direct fine-tuning on factual data without contrast can increase hallucinations and response ratio.

Core Entities

Models

Llama2-7B-ChatLlama2-7B-BaseLlama2-13B-ChatLlama2-70B-ChatMistral-7B-InstructBaichuan2-7B-ChatChatGPTGPT4

Metrics

MC1MC2MC3FACTSCORE (%response, #facts, score)MMLU 5-shotARC 5-shotAlpacaEval2.0 win rate vs GPT-4-turbo

Datasets

TruthfulQAFACTSCOREHaluEvalWikipedia (for BIOs)

Benchmarks

TruthfulQAFACTSCOREMMLUARCAlpacaEval2.0