Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
39
Why It Matters For Business
ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.
Summary TLDR
The paper introduces Inference-Time Intervention (ITI): find a small set of attention-head directions correlated with truth and add a scaled shift to those head activations during generation. With lightweight supervision (hundreds of QA examples) and near-zero compute overhead, ITI raises truthfulness on the adversarial TruthfulQA benchmark (e.g., Alpaca true*informative 32.5% -> 65.1%) and gives modest gains on other QA datasets. There is a clear trade-off: stronger intervention increases truth but can reduce helpfulness or push the model toward non-answers. ITI requires access to internal activations and careful tuning of two hyperparameters (K heads, strength α).
Problem Statement
Large language models sometimes 'know' correct facts internally but produce false or misleading answers in generation. Can we steer a pretrained model at inference time—without expensive RL or full finetuning—to make it output more truthful answers using small labeled data?
Main Contribution
Define Inference-Time Intervention (ITI): during autoregressive decoding, add a scaled vector shift along 'truthful' directions in a small set of attention heads.
Show probes reveal a gap between intermediate representation accuracy and surface generation; ITI closes part of that gap and boosts TruthfulQA scores significantly.
Demonstrate ITI is cheap and data-efficient: truthful directions found with a few hundred examples and intervention adds near-zero runtime cost.
Key Findings
ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.
For base LLaMA-7B, probing shows internal activations contain truth signals not shown in outputs.
Selecting a small set of heads and using mass-mean shift works best.
ITI generalizes modestly out of distribution on other QA benchmarks.
There is a trade-off between truthfulness and helpfulness; scores follow an upside-down U with intervention strength.
ITI is computationally cheap and data-efficient.
ITI changes model output distribution modestly at tuned settings.
Perturbing all heads or doing dense interventions performs worse than head-wise sparse selection.
Results
True*Informative (LLaMA-7B baseline)
True*Informative (LLaMA-7B + ITI)
True*Informative (Alpaca)
True*Informative (Alpaca + ITI)
Accuracy
Generalization (Natural Questions MC acc.)
CE (cross entropy) change after ITI
Who Should Care
What To Try In 7 Days
Run probes on a small labeled set (50–500 QA pairs) to rank truth-related heads.
Implement mass-mean shift on the top K heads; sweep K and α on a dev holdout while tracking true*informative, CE, and KL.
Evaluate with a small human annotation pass on high-risk question categories to confirm reduced harmful errors.
Optimization Features
Inference Optimization
- Adds a constant vector per layer; near-zero runtime overhead
Reproducibility
Data Urls
- TruthfulQA (public)
- NaturalQuestions (public)
- TriviaQA (public)
- MMLU (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires access to internal activations and ability to modify attention outputs at inference.
- Evaluated primarily on TruthfulQA; gains on other datasets are smaller and uneven.
- Strong interventions can reduce helpfulness or increase non-answers like 'I have no comment.'
- ITI shifts activations linearly; causal interpretation of directions is incomplete.
When Not To Use
- You cannot read or modify model activations at inference (closed API).
- When conversational helpfulness and diversity are higher priority than conservative truth.
- If you lack any labeled examples to find truthful directions.
Failure Modes
- Overcorrection: model gives less informative or evasive replies ('I have no comment').
- Mis-calibration: some categories see flipped correct->incorrect answers.
- Direction misidentification when training data is too small or biased.
- Intervention could amplify dataset-specific biases represented in the probe labels.
Core Entities
Models
- LLaMA-7B
- Alpaca
- Vicuna
- LLaMA (family)
Metrics
- True*Informative
- True (%)
- Accuracy
- Cross Entropy (CE)
- KL divergence (next-token)
Datasets
- TruthfulQA
- Natural Questions
- TriviaQA
- MMLU
- OpenWebText
Benchmarks
- TruthfulQA
- Natural Questions
- TriviaQA
- MMLU

