Moderate WANDA pruning (10–20%) increases jailbreak resistance of 7B LLMs without fine-tuning

January 19, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

3

Authors

Adib Hasan, Ileana Rugina, Alex Wang

Links

Abstract / PDF

Why It Matters For Business

Pruning attention weights at modest sparsity (10–20%) is a low-cost safety lever: it can raise refusal rates to harmful prompts and shrink model size without extra fine-tuning or big performance loss.

Summary TLDR

The authors show that moderate WANDA pruning of attention weights (roughly 10–20%) on 7B models can raise refusal rates to malicious prompts without any further fine-tuning and while keeping benchmark performance nearly intact. They curated 225 malicious tasks (2,250 prompt samples), tested LLaMA-2-7B-Chat, Vicuna-1.3-7B, and Mistral-7B Instruct, and analyzed attention entropy and perplexity shifts to argue pruning acts as a regularizer that helps detect unnatural jailbreak constructs.

Problem Statement

How does model compression affect an aligned LLM's susceptibility to jailbreak attacks? The paper asks whether pruning (WANDA) can increase jailbreak resistance without additional training and without degrading standard task performance.

Main Contribution

Curated a safety-focused dataset: 225 malicious tasks split into 5 categories and embedded in 10 jailbreak templates (2,250 samples).

Showed attention-layer WANDA pruning (10–20%) raises refusal rates on 7B models without fine-tuning, with peak benefits at ~20% sparsity.

Benchmarked pruned models on common tasks and WikiText perplexity and found no major capability drops at moderate sparsity.

Interpreted effects via attention entropy, an IgnoreJailbreak attention metric, and perplexity shifts on jailbreak templates.

Validated WANDA's regularizing effect in linear OLS regressions with correlated features (reduced test MSE).

Key Findings

Moderate attention-layer WANDA pruning increases refusal rates to jailbreak prompts.

NumbersLLaMA-2: average +8.5% refusal rate across five categories

Safety benefits peak near 20% sparsity and reverse after heavier pruning.

NumbersResistance improves up to 20% then degrades by 30% sparsity

Benchmark and language modeling performance stay largely intact at moderate sparsity.

NumbersWikiText perplexity: 6.943 (base) → 7.158 (20% prune); leaderboard scores show small deltas

Pruning can defend against an automated adversarial attack (GCG) in a single-model setup.

NumbersGCG on LLaMA-2: 0 successes/10 at 30% sparsity (p=0.03)

Pruned models show sharper attention and higher perplexity for artificial jailbreak constructs.

NumbersAttention entropy reduction and IgnoreJailbreak metric both peak near 20% sparsity

Results

Refusal rate (safety)

ValueLLaMA-2: avg +8.5% refusal (post-pruning)

Baselineunpruned LLaMA-2 Chat

Perplexity (WikiText)

Value6.943 → 7.158 (20% sparsity)

Baselineunpruned LLaMA-2 (6.943)

GCG adversarial success (single-model)

Value0 successes / 10 (LLaMA-2 at 30% sparsity)

Baseline4 successes / 10 (unpruned LLaMA-2)

Refusals on embedded AdvBench prompts

Valuebase 5699 → 20% prune 5706 refusals (out of 5720)

Baselinebase model refusals 5699 of 5720

Attention entropy / IgnoreJailbreak

ValueEntropy reduced and IgnoreJailbreak peaks at ~20% sparsity

Baselineunpruned attention entropy

Who Should Care

What To Try In 7 Days

Run attention-layer WANDA pruning at 10% and 20% on a copy of your 7B model and keep the original.

Measure refusal rate on an internal set of risky prompts and compare to unpruned baseline.

Validate core business benchmarks (MMLU, task-specific tests) to ensure capability is preserved (>look for small deltas).

Optimization Features

Infra Optimization

  • lower memory footprint from pruned weights

Model Optimization

  • WANDA pruning (attention-layer, 10–30% sparsity)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments limited to 7B models; effects on larger models are unknown.
  • Only WANDA pruning evaluated; other compression methods may differ.
  • Curated malicious tasks are hypothetical; real-world prompts may differ.
  • LLM judge was a fine-tuned ChatGPT-3.5 model, which can introduce label bias.
  • GCG automated test used only 10 AdvBench examples due to compute limits.

When Not To Use

  • Do not rely on pruning as the sole safety measure for poorly aligned models.
  • Avoid pruning >30% sparsity in production as it can reduce safety and capabilities.
  • Not a substitute for human moderation for high-risk deployments.

Failure Modes

  • Over-pruning (>30%) reduces alignment and can increase harmful outputs.
  • Models with little or no prior safety training (e.g., no RLHF) may see little benefit.
  • Safety gains may not transfer to different model sizes or architectures.

Core Entities

Models

  • LLaMA-2-7B-Chat
  • Vicuna-1.3-7B
  • Mistral-Instruct-v0.2-7B
  • WANDA pruning (attention-layer)

Metrics

  • refusal rate
  • perplexity
  • attention entropy
  • IgnoreJailbreak metric
  • Accuracy
  • MSE (linear models)

Datasets

  • Custom malicious tasks dataset (225 tasks, 2,250 samples)
  • AdvBench harmful behavior dataset
  • WikiText (perplexity)
  • Open LLM Leaderboard tasks (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K, AltQA)

Benchmarks

  • Open LLM Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K, AltQA)
  • AdvBench

Context Entities

Models

  • LLaMA-2 base models and RLHF-aligned versions
  • Vicuna and Mistral fine-tuned variants

Metrics

  • p-values for GCG attack experiments
  • per-sparsity attention/head statistics

Datasets

  • AdvBench (for automated attacks)
  • AltQA (effective context length test)

Benchmarks

  • Open LLM Leaderboard components