Shift a few attention-head activations at inference to make LLMs answer more truthfully

June 6, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

39

Authors

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

Links

Abstract / PDF

Why It Matters For Business

ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.

Summary TLDR

The paper introduces Inference-Time Intervention (ITI): find a small set of attention-head directions correlated with truth and add a scaled shift to those head activations during generation. With lightweight supervision (hundreds of QA examples) and near-zero compute overhead, ITI raises truthfulness on the adversarial TruthfulQA benchmark (e.g., Alpaca true*informative 32.5% -> 65.1%) and gives modest gains on other QA datasets. There is a clear trade-off: stronger intervention increases truth but can reduce helpfulness or push the model toward non-answers. ITI requires access to internal activations and careful tuning of two hyperparameters (K heads, strength α).

Problem Statement

Large language models sometimes 'know' correct facts internally but produce false or misleading answers in generation. Can we steer a pretrained model at inference time—without expensive RL or full finetuning—to make it output more truthful answers using small labeled data?

Main Contribution

Define Inference-Time Intervention (ITI): during autoregressive decoding, add a scaled vector shift along 'truthful' directions in a small set of attention heads.

Show probes reveal a gap between intermediate representation accuracy and surface generation; ITI closes part of that gap and boosts TruthfulQA scores significantly.

Demonstrate ITI is cheap and data-efficient: truthful directions found with a few hundred examples and intervention adds near-zero runtime cost.

Key Findings

ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.

NumbersAlpaca true*informative 32.5% -> 65.1%

For base LLaMA-7B, probing shows internal activations contain truth signals not shown in outputs.

NumbersProbe vs generation gap ≈ 40% (probe higher)

Selecting a small set of heads and using mass-mean shift works best.

NumbersMass mean shift True*Info 42.3% vs probe-weight 34.8% (LLaMA-7B)

ITI generalizes modestly out of distribution on other QA benchmarks.

NumbersNaturalQ 46.6% -> 51.3%; MMLU 35.71% -> 40.16%

There is a trade-off between truthfulness and helpfulness; scores follow an upside-down U with intervention strength.

NumbersTrue*informative rises then falls as α increases (Figure 4/6)

ITI is computationally cheap and data-efficient.

NumbersRequires as few as 40 samples to locate truthful heads; intervention is a constant per layer vector

ITI changes model output distribution modestly at tuned settings.

NumbersLLaMA-7B Baseline CE 2.16 -> Baseline+ITI CE 2.48; KL 0.40

Perturbing all heads or doing dense interventions performs worse than head-wise sparse selection.

NumbersHead-wise True*Info 42.3% vs without-selection 35.4%

Results

True*Informative (LLaMA-7B baseline)

Value30.5%

True*Informative (LLaMA-7B + ITI)

Value43.5%

Baseline30.5%

True*Informative (Alpaca)

Value32.5%

True*Informative (Alpaca + ITI)

Value65.1%

Baseline32.5%

Accuracy

Value83.3%

Baseline50% random

Generalization (Natural Questions MC acc.)

Value46.6% -> 51.3%

Baseline46.6%

CE (cross entropy) change after ITI

Value2.16 -> 2.48

Baseline2.16

Who Should Care

What To Try In 7 Days

Run probes on a small labeled set (50–500 QA pairs) to rank truth-related heads.

Implement mass-mean shift on the top K heads; sweep K and α on a dev holdout while tracking true*informative, CE, and KL.

Evaluate with a small human annotation pass on high-risk question categories to confirm reduced harmful errors.

Optimization Features

Inference Optimization

  • Adds a constant vector per layer; near-zero runtime overhead

Reproducibility

Data Urls

  • TruthfulQA (public)
  • NaturalQuestions (public)
  • TriviaQA (public)
  • MMLU (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires access to internal activations and ability to modify attention outputs at inference.
  • Evaluated primarily on TruthfulQA; gains on other datasets are smaller and uneven.
  • Strong interventions can reduce helpfulness or increase non-answers like 'I have no comment.'
  • ITI shifts activations linearly; causal interpretation of directions is incomplete.

When Not To Use

  • You cannot read or modify model activations at inference (closed API).
  • When conversational helpfulness and diversity are higher priority than conservative truth.
  • If you lack any labeled examples to find truthful directions.

Failure Modes

  • Overcorrection: model gives less informative or evasive replies ('I have no comment').
  • Mis-calibration: some categories see flipped correct->incorrect answers.
  • Direction misidentification when training data is too small or biased.
  • Intervention could amplify dataset-specific biases represented in the probe labels.

Core Entities

Models

  • LLaMA-7B
  • Alpaca
  • Vicuna
  • LLaMA (family)

Metrics

  • True*Informative
  • True (%)
  • Accuracy
  • Cross Entropy (CE)
  • KL divergence (next-token)

Datasets

  • TruthfulQA
  • Natural Questions
  • TriviaQA
  • MMLU
  • OpenWebText

Benchmarks

  • TruthfulQA
  • Natural Questions
  • TriviaQA
  • MMLU