Overview
ITI is a pragmatic, low-cost inference hack that boosts truthfulness on specific benchmarks, but it requires activation access and careful tuning to avoid harming helpfulness.
Citations39
Evidence Strength0.80
Confidence0.86
Risk Signals11
Trust Signals
Findings with numeric evidence: 8/8
Findings with evidence refs: 8/8
Results with explicit delta: 5/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.
Who Should Care
Summary TLDR
The paper introduces Inference-Time Intervention (ITI): find a small set of attention-head directions correlated with truth and add a scaled shift to those head activations during generation. With lightweight supervision (hundreds of QA examples) and near-zero compute overhead, ITI raises truthfulness on the adversarial TruthfulQA benchmark (e.g., Alpaca true*informative 32.5% -> 65.1%) and gives modest gains on other QA datasets. There is a clear trade-off: stronger intervention increases truth but can reduce helpfulness or push the model toward non-answers. ITI requires access to internal activations and careful tuning of two hyperparameters (K heads, strength α).
Problem Statement
Large language models sometimes 'know' correct facts internally but produce false or misleading answers in generation. Can we steer a pretrained model at inference time—without expensive RL or full finetuning—to make it output more truthful answers using small labeled data?
Main Contribution
Define Inference-Time Intervention (ITI): during autoregressive decoding, add a scaled vector shift along 'truthful' directions in a small set of attention heads.
Show probes reveal a gap between intermediate representation accuracy and surface generation; ITI closes part of that gap and boosts TruthfulQA scores significantly.
Key Findings
ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.
For base LLaMA-7B, probing shows internal activations contain truth signals not shown in outputs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| True*Informative (LLaMA-7B baseline) | 30.5% | — | — | TruthfulQA generation | Table 1 baseline | Table 1 |
| True*Informative (LLaMA-7B + ITI) | 43.5% | 30.5% | +13.0 pp | TruthfulQA generation | ITI applied on LLaMA-7B using 5% training | Table 1 |
What To Try In 7 Days
Run probes on a small labeled set (50–500 QA pairs) to rank truth-related heads.
Implement mass-mean shift on the top K heads; sweep K and α on a dev holdout while tracking true*informative, CE, and KL.
Evaluate with a small human annotation pass on high-risk question categories to confirm reduced harmful errors.
Optimization Features
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires access to internal activations and ability to modify attention outputs at inference.
Evaluated primarily on TruthfulQA; gains on other datasets are smaller and uneven.
When Not To Use
You cannot read or modify model activations at inference (closed API).
When conversational helpfulness and diversity are higher priority than conservative truth.
Failure Modes
Overcorrection: model gives less informative or evasive replies ('I have no comment').
Mis-calibration: some categories see flipped correct->incorrect answers.

