Edit hidden activations with SVD to make LLMs more truthful and less biased at inference time

May 15, 20247 min

Overview

Decision SnapshotReady For Pilot

SEA is practical: compute SVD on collected activations offline, apply inexpensive projections at inference; linear SEA is low-risk, Φ-SEA trades larger bias gains for some degradation in other skills and needs careful testing.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 75%

Novelty: 70%

Authors

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SEA gives a low-cost way to reduce hallucinations and bias at inference time, letting teams improve trustworthiness without full model fine-tuning or heavy compute.

Who Should Care

Summary TLDR

SEA (Spectral Editing of Activations) is a training-free, inference-time method that uses singular value decomposition (SVD) on cross-covariances of model activations to push activations toward ‘positive’ demonstrations and away from ‘negative’ ones. Linear SEA gives modest but consistent gains in truthfulness and fairness with very low compute overhead; a non-linear variant (Φ-SEA) gives larger bias fixes but can slightly degrade some other skills. SEA works with small demonstration sets (as few as 25 examples) and across multiple open-source LLM families.

Problem Statement

Large language models still produce hallucinations and biased outputs. Existing fixes need expensive fine-tuning or complex decoding tweaks. Can we change model behavior cheaply at inference time by editing internal activations so outputs become more truthful and fair without retraining?

Main Contribution

Introduce SEA: a training-free method that finds linear editing projections by SVD on cross-covariances between neutral, positive and negative activations.

Extend SEA to non-linear editing (Φ-SEA) via invertible feature maps and pseudo-inverses to capture non-linearly separable behaviors.

Key Findings

Linear SEA raises MC1 truthfulness on TruthfulQA for LLaMA-2-chat-7B.

NumbersMC1 36.9639.41 (+2.45)

Practical UseUse linear SEA on 7B chat models to get small but reliable truthfulness gains without retraining.

Evidence RefTable 1; Section 4.1

Φ-SEA (non-linear) greatly improves bias accuracy on BBQ for LLaMA-2-chat-7B.

NumbersBBQ accuracy 43.0256.17 (+13.15)

Practical UseApply Φ-SEA with squared-exponential features when fairness is the priority, but test other capabilities first.

Evidence RefSection 4.2; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TruthfulQA MC1 (LLaMA-2-Chat-7B)39.4136.96 (ICL)+2.45TruthfulQA multiple-choiceTable 1 (SEA N=2000, K=99.8%, L=21)Table 1; Section 4.1
Accuracy56.1743.02 (ICL)+13.15BBQ disambiguated evaluationSection 4.2; Table 3 (Φ-SEA, squared-exponential)Table 3; Section 4.2

What To Try In 7 Days

Collect 25–200 positive/negative demonstration pairs for your task.

Compute linear SEA projections (SVD on cross-covariances) for the top layers.

Apply edits on the last few MLP outputs and compare output accuracy and latency vs baseline ICL and LoRA-FT on a dev set.

Optimization Features

Token Efficiency
not constrained by model context length (offline SVD uses arbitrary number of demos)
Infra Optimization
works on standard GPU setups; minimal extra GPU needs compared to running model
Model Optimization
edits activations via orthogonal projections (no weight change)
Training Optimization
training-free: compute SVD on collected activationsfast projection computation (e.g., 21-layer SVD ~2m32s on A100)
Inference Optimization
apply linear projections at inference with small latency (+≈3.7%)editing limited to top L layers to trade off cost vs benefit

Reproducibility

Risks & Boundaries

Limitations

Φ-SEA's pseudo-inverse feature transforms are not lossless and can reduce performance on some control tasks.

Requires paired positive and negative demonstrations; quality of these demos strongly affects results.

When Not To Use

You lack reliable positive/negative demonstrations for the target behavior.

Your application cannot tolerate any drop in downstream control tasks (e.g., math or commonsense) from non-linear edits.

Failure Modes

Overfitting to idiosyncratic patterns in the demonstration set, causing brittle edits.

Introducing new biases if demonstrations themselves are biased or unrepresentative.

Core Entities

Models

LLaMA-2-Chat-7BLLaMA-2-13BLLaMA-2-70BGemma-it-2BGemma-it-7BMistral-7B

Metrics

MC1MC2AccuracyUnknown-answer rateBias scoreStereotypical response rateInfoTruthInfo*TruthInference time

Datasets

TruthfulQAHaluEvalBBQCrowS-PairsHellaSwagNatural QuestionsGSM8KMathQAMMLUToxiGen

Benchmarks

TruthfulQABBQCrowS-PairsHaluEval