Edit hidden activations with SVD to make LLMs more truthful and less biased at inference time

Overview

Decision SnapshotReady For Pilot

SEA is practical: compute SVD on collected activations offline, apply inexpensive projections at inference; linear SEA is low-risk, Φ-SEA trades larger bias gains for some degradation in other skills and needs careful testing.

Citations1

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 75%

Novelty: 70%

Authors

Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SEA gives a low-cost way to reduce hallucinations and bias at inference time, letting teams improve trustworthiness without full model fine-tuning or heavy compute.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

SEA (Spectral Editing of Activations) is a training-free, inference-time method that uses singular value decomposition (SVD) on cross-covariances of model activations to push activations toward ‘positive’ demonstrations and away from ‘negative’ ones. Linear SEA gives modest but consistent gains in truthfulness and fairness with very low compute overhead; a non-linear variant (Φ-SEA) gives larger bias fixes but can slightly degrade some other skills. SEA works with small demonstration sets (as few as 25 examples) and across multiple open-source LLM families.

Problem Statement

Large language models still produce hallucinations and biased outputs. Existing fixes need expensive fine-tuning or complex decoding tweaks. Can we change model behavior cheaply at inference time by editing internal activations so outputs become more truthful and fair without retraining?

Main Contribution

Introduce SEA: a training-free method that finds linear editing projections by SVD on cross-covariances between neutral, positive and negative activations.

Extend SEA to non-linear editing (Φ-SEA) via invertible feature maps and pseudo-inverses to capture non-linearly separable behaviors.

Key Findings

Linear SEA raises MC1 truthfulness on TruthfulQA for LLaMA-2-chat-7B.

NumbersMC1 36.96 → 39.41 (+2.45)

Practical UseUse linear SEA on 7B chat models to get small but reliable truthfulness gains without retraining.

Evidence RefTable 1; Section 4.1

Φ-SEA (non-linear) greatly improves bias accuracy on BBQ for LLaMA-2-chat-7B.

NumbersBBQ accuracy 43.02 → 56.17 (+13.15)

Practical UseApply Φ-SEA with squared-exponential features when fairness is the priority, but test other capabilities first.

Evidence RefSection 4.2; Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TruthfulQA MC1 (LLaMA-2-Chat-7B)	39.41	36.96 (ICL)	+2.45	TruthfulQA multiple-choice	Table 1 (SEA N=2000, K=99.8%, L=21)	Table 1; Section 4.1
Accuracy	56.17	43.02 (ICL)	+13.15	BBQ disambiguated evaluation	Section 4.2; Table 3 (Φ-SEA, squared-exponential)	Table 3; Section 4.2

What To Try In 7 Days

Collect 25–200 positive/negative demonstration pairs for your task.

Compute linear SEA projections (SVD on cross-covariances) for the top layers.

Apply edits on the last few MLP outputs and compare output accuracy and latency vs baseline ICL and LoRA-FT on a dev set.

Optimization Features

Token Efficiency

not constrained by model context length (offline SVD uses arbitrary number of demos)

Infra Optimization

works on standard GPU setups; minimal extra GPU needs compared to running model

Model Optimization

edits activations via orthogonal projections (no weight change)

Training Optimization

training-free: compute SVD on collected activationsfast projection computation (e.g., 21-layer SVD ~2m32s on A100)

Inference Optimization

apply linear projections at inference with small latency (+≈3.7%)editing limited to top L layers to trade off cost vs benefit

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/yfqiu-nlp/sea-llm

Data URLs

https://github.com/sylinrl/TruthfulQA (TruthfulQA)https://github.com/RUCAIBox/HaluEval (HaluEval)https://github.com/nyu-mll/BBQ (BBQ)

Risks & Boundaries

Limitations

Φ-SEA's pseudo-inverse feature transforms are not lossless and can reduce performance on some control tasks.

Requires paired positive and negative demonstrations; quality of these demos strongly affects results.

When Not To Use

You lack reliable positive/negative demonstrations for the target behavior.

Your application cannot tolerate any drop in downstream control tasks (e.g., math or commonsense) from non-linear edits.

Failure Modes

Overfitting to idiosyncratic patterns in the demonstration set, causing brittle edits.

Introducing new biases if demonstrations themselves are biased or unrepresentative.

Core Entities

Models

LLaMA-2-Chat-7BLLaMA-2-13BLLaMA-2-70BGemma-it-2BGemma-it-7BMistral-7B

Metrics

MC1MC2AccuracyUnknown-answer rateBias scoreStereotypical response rateInfoTruthInfo*TruthInference time

Datasets

TruthfulQAHaluEvalBBQCrowS-PairsHellaSwagNatural QuestionsGSM8KMathQAMMLUToxiGen

Benchmarks

TruthfulQABBQCrowS-PairsHaluEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Linear SEA raises MC1 truthfulness on TruthfulQA for LLaMA-2-chat-7B.

Φ-SEA (non-linear) greatly improves bias accuracy on BBQ for LLaMA-2-chat-7B.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding