Shift a few attention-head activations at inference to make LLMs answer more truthfully

Overview

Decision SnapshotNeeds Validation

ITI is a pragmatic, low-cost inference hack that boosts truthfulness on specific benchmarks, but it requires activation access and careful tuning to avoid harming helpfulness.

Citations39

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 5/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

The paper introduces Inference-Time Intervention (ITI): find a small set of attention-head directions correlated with truth and add a scaled shift to those head activations during generation. With lightweight supervision (hundreds of QA examples) and near-zero compute overhead, ITI raises truthfulness on the adversarial TruthfulQA benchmark (e.g., Alpaca true*informative 32.5% -> 65.1%) and gives modest gains on other QA datasets. There is a clear trade-off: stronger intervention increases truth but can reduce helpfulness or push the model toward non-answers. ITI requires access to internal activations and careful tuning of two hyperparameters (K heads, strength α).

Problem Statement

Large language models sometimes 'know' correct facts internally but produce false or misleading answers in generation. Can we steer a pretrained model at inference time—without expensive RL or full finetuning—to make it output more truthful answers using small labeled data?

Main Contribution

Define Inference-Time Intervention (ITI): during autoregressive decoding, add a scaled vector shift along 'truthful' directions in a small set of attention heads.

Show probes reveal a gap between intermediate representation accuracy and surface generation; ITI closes part of that gap and boosts TruthfulQA scores significantly.

Key Findings

ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.

NumbersAlpaca true*informative 32.5% -> 65.1%

Practical UseIf you run ITI on Alpaca, expect roughly a doubling of the measured truthfulness on adversarial TruthfulQA; tune α to balance helpfulness.

Evidence RefTable 2

For base LLaMA-7B, probing shows internal activations contain truth signals not shown in outputs.

NumbersProbe vs generation gap ≈ 40% (probe higher)

Practical UseYou can locate truth-related directions by linear probes on head activations because the model encodes correctness internally.

Evidence RefIntro and subsection 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
True*Informative (LLaMA-7B baseline)	30.5%	—	—	TruthfulQA generation	Table 1 baseline	Table 1
True*Informative (LLaMA-7B + ITI)	43.5%	30.5%	+13.0 pp	TruthfulQA generation	ITI applied on LLaMA-7B using 5% training	Table 1

What To Try In 7 Days

Run probes on a small labeled set (50–500 QA pairs) to rank truth-related heads.

Implement mass-mean shift on the top K heads; sweep K and α on a dev holdout while tracking true*informative, CE, and KL.

Evaluate with a small human annotation pass on high-risk question categories to confirm reduced harmful errors.

Optimization Features

Inference Optimization

Adds a constant vector per layer; near-zero runtime overhead

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/likenneth/honest_llama

Data URLs

TruthfulQA (public)NaturalQuestions (public)TriviaQA (public)MMLU (public)

Risks & Boundaries

Limitations

Requires access to internal activations and ability to modify attention outputs at inference.

Evaluated primarily on TruthfulQA; gains on other datasets are smaller and uneven.

When Not To Use

You cannot read or modify model activations at inference (closed API).

When conversational helpfulness and diversity are higher priority than conservative truth.

Failure Modes

Overcorrection: model gives less informative or evasive replies ('I have no comment').

Mis-calibration: some categories see flipped correct->incorrect answers.

Core Entities

Models

LLaMA-7BAlpacaVicunaLLaMA (family)

Metrics

True*InformativeTrue (%)AccuracyCross Entropy (CE)KL divergence (next-token)

Datasets

TruthfulQANatural QuestionsTriviaQAMMLUOpenWebText

Benchmarks

TruthfulQANatural QuestionsTriviaQAMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.

For base LLaMA-7B, probing shows internal activations contain truth signals not shown in outputs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding