BiPO: optimize single-layer activation vectors to steer LLM behavior both ways

May 28, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, Jinghui Chen

Links

Abstract / PDF

Why It Matters For Business

BiPO gives a cheap, flexible way to shift model behavior without weight updates: personalize or harden models quickly, reuse vectors across similar models, and combine vectors for new behaviors while keeping knowledge performance intact.

Summary TLDR

The paper introduces Bi-directional Preference Optimization (BiPO), a lightweight method that learns a single activation steering vector so you can push a frozen LLM toward or away from a behavior by adding the vector at one transformer layer. BiPO optimizes the vector to increase the model's probability of preferred responses and decrease the probability of opposite responses (contrastive pairs). Experiments on Llama-2-7b-chat-hf and Mistral-7B show stronger, more controllable steering than prior activation-difference methods, transfer across related models and LoRA-fine-tuned models, vector composition (additive effects), and limited impact on knowledge (MMLU). BiPO can both enable and def

Problem Statement

Existing steering vectors are often built from raw activation differences on paired prompts and can fail because appended prompts do not match what the model actually generates. That makes extracted vectors a poor match for real generation, especially for alignment-critical behaviors. The paper asks: can we optimize a small steering vector directly for generation preference so it better represents the target behavior and is controllable, transferable, and cheap to apply?

Main Contribution

BiPO: a method that optimizes a single-layer activation vector to increase generation probability of target responses and decrease opposite ones.

Comprehensive empirical tests showing stronger steering than contrastive activation addition (CAA) and a freeform baseline across personas, truthfulness, hallucination, and jailbreaking.

Demonstrations of transferability across models and LoRA-fine-tuned variants, and of vector composition (adding vectors yields combined behaviors).

Key Findings

Optimized steering vectors from BiPO produce a wider and more controllable range of persona steering than prior methods.

BiPO can enable and disable jailbreaking: adding the learned vector raised attack success rate to 73% on malicious prompts; subtracting the vector dropped ASR to 0% on adversarial-suffix attacks.

NumbersASR +v*: 73%; initial: 0%; adversarial-suffix initial: 16%; -v*: 0%

Applying persona steering vectors caused negligible change in academic-knowledge performance (MMLU).

NumbersMMLU baseline ~0.459; variations ≤0.005 across tested multipliers

Steering vectors trained on Llama-2-7b-chat-hf transferred to Vicuna-7b and to a LoRA-fine-tuned Llama2-Chinese-7b-Chat (even with Chinese inputs).

Different steering vectors can be added together and often preserve both behaviors or produce fused behavior.

NumbersCombined vectors retained/increased persona scores (see Table 6)

Results

Jailbreaking Attack Success Rate (ASR)

Value73% with +v* on malicious questions

BaselineInitial model 0%

Jailbreaking ASR with adversarial suffix

Value0% with -v*

BaselineInitial model 16%

Accuracy

Valuebaseline ~0.459; -1/+1 multipliers yield 0.454–0.460

Baseline0.459 (multiplier 0)

Persona steering (GPT-4 1-4 score)

ValueBiPO delivers a broader steering range than CAA and Freeform across tested personas

BaselineCAA / Freeform weaker range

Who Should Care

What To Try In 7 Days

Run BiPO on a small preference-pair dataset to get a steering vector for one targeted behavior.

Apply the vector at a middle layer and sweep multipliers (-2 to +2) to inspect intensity and side effects.

Evaluate with a reliable judge (GPT-4 or human raters) on a held-out set for both success and safety risks (jailbreak tests).

Agent Features

Frameworks

  • DPO
  • CAA
  • LoRA

Architectures

  • transformer activation-space steering (single layer)

Optimization Features

Token Efficiency

  • no extra demonstration tokens needed

Infra Optimization

  • LoRA

System Optimization

  • single-layer intervention to reduce runtime impact

Training Optimization

  • optimize small steering vector (AdamW) instead of model weights
  • batch-size 4, low compute (single A100)

Inference Optimization

  • broadcast-add vector to activations at one layer
  • control intensity by scaling multiplier

Reproducibility

Data Urls

  • Anthropic Model-Written datasets (public)
  • TruthfulQA (public)
  • AdvBench (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Steering is implemented on a single transformer layer; multi-layer designs may be stronger but are unexplored.
  • Evaluation relies heavily on GPT-4 judgments, which can introduce bias.
  • The method produces artifacts that can be abused to jailbreak models; safety handling is required.
  • Transferability was tested for models sharing architecture and activation size; cross-architecture transfer is untested.

When Not To Use

  • When you need provable, formally verified safety guarantees.
  • When you must change deep internal capabilities that require weight updates.
  • If you cannot access intermediate activations or inject activation perturbations at inference.

Failure Modes

  • Vector over-amplification can create extreme, undesirable behavior.
  • Poor training pairs yield ineffective or misleading steering vectors.
  • Attackers could repurpose vectors to increase harmful outputs.
  • Layer selection choices strongly affect effectiveness and can fail silently.

Core Entities

Models

  • Llama-2-7b-chat-hf
  • Mistral-7B-Instruct-v0.2
  • Vicuna-7b-v1.5
  • Llama2-Chinese-7b-Chat

Metrics

  • Attack Success Rate (ASR)
  • GPT-4 persona score (1-4)
  • Accuracy

Datasets

  • Anthropic Model-Written (Advanced AI Risk personas)
  • TruthfulQA
  • Unprompted Hallucination (Rimsky et al.)
  • AdvBench
  • MMLU

Benchmarks

  • TruthfulQA
  • AdvBench
  • MMLU