Shift a few attention-head activations at inference to make LLMs answer more truthfully

June 6, 20237 min

Overview

Decision SnapshotNeeds Validation

ITI is a pragmatic, low-cost inference hack that boosts truthfulness on specific benchmarks, but it requires activation access and careful tuning to avoid harming helpfulness.

Citations39

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 5/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.

Who Should Care

Summary TLDR

The paper introduces Inference-Time Intervention (ITI): find a small set of attention-head directions correlated with truth and add a scaled shift to those head activations during generation. With lightweight supervision (hundreds of QA examples) and near-zero compute overhead, ITI raises truthfulness on the adversarial TruthfulQA benchmark (e.g., Alpaca true*informative 32.5% -> 65.1%) and gives modest gains on other QA datasets. There is a clear trade-off: stronger intervention increases truth but can reduce helpfulness or push the model toward non-answers. ITI requires access to internal activations and careful tuning of two hyperparameters (K heads, strength α).

Problem Statement

Large language models sometimes 'know' correct facts internally but produce false or misleading answers in generation. Can we steer a pretrained model at inference time—without expensive RL or full finetuning—to make it output more truthful answers using small labeled data?

Main Contribution

Define Inference-Time Intervention (ITI): during autoregressive decoding, add a scaled vector shift along 'truthful' directions in a small set of attention heads.

Show probes reveal a gap between intermediate representation accuracy and surface generation; ITI closes part of that gap and boosts TruthfulQA scores significantly.

Key Findings

ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.

NumbersAlpaca true*informative 32.5% -> 65.1%

Practical UseIf you run ITI on Alpaca, expect roughly a doubling of the measured truthfulness on adversarial TruthfulQA; tune α to balance helpfulness.

Evidence RefTable 2

For base LLaMA-7B, probing shows internal activations contain truth signals not shown in outputs.

NumbersProbe vs generation gap ≈ 40% (probe higher)

Practical UseYou can locate truth-related directions by linear probes on head activations because the model encodes correctness internally.

Evidence RefIntro and subsection 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
True*Informative (LLaMA-7B baseline)30.5%TruthfulQA generationTable 1 baselineTable 1
True*Informative (LLaMA-7B + ITI)43.5%30.5%+13.0 ppTruthfulQA generationITI applied on LLaMA-7B using 5% trainingTable 1

What To Try In 7 Days

Run probes on a small labeled set (50–500 QA pairs) to rank truth-related heads.

Implement mass-mean shift on the top K heads; sweep K and α on a dev holdout while tracking true*informative, CE, and KL.

Evaluate with a small human annotation pass on high-risk question categories to confirm reduced harmful errors.

Optimization Features

Inference Optimization
Adds a constant vector per layer; near-zero runtime overhead

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

TruthfulQA (public)NaturalQuestions (public)TriviaQA (public)MMLU (public)

Risks & Boundaries

Limitations

Requires access to internal activations and ability to modify attention outputs at inference.

Evaluated primarily on TruthfulQA; gains on other datasets are smaller and uneven.

When Not To Use

You cannot read or modify model activations at inference (closed API).

When conversational helpfulness and diversity are higher priority than conservative truth.

Failure Modes

Overcorrection: model gives less informative or evasive replies ('I have no comment').

Mis-calibration: some categories see flipped correct->incorrect answers.

Core Entities

Models

LLaMA-7BAlpacaVicunaLLaMA (family)

Metrics

True*InformativeTrue (%)AccuracyCross Entropy (CE)KL divergence (next-token)

Datasets

TruthfulQANatural QuestionsTriviaQAMMLUOpenWebText

Benchmarks

TruthfulQANatural QuestionsTriviaQAMMLU