Moderate WANDA pruning (10–20%) increases jailbreak resistance of 7B LLMs without fine-tuning

January 19, 20248 min

Overview

Decision SnapshotNeeds Validation

Results are practical: modest attention pruning gives measurable safety gains on 7B models without fine-tuning and with small benchmark impact; evidence is limited to 7B models, one pruning method, and curated datasets.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Adib Hasan, Ileana Rugina, Alex Wang

Links

Abstract / PDF

Why It Matters For Business

Pruning attention weights at modest sparsity (10–20%) is a low-cost safety lever: it can raise refusal rates to harmful prompts and shrink model size without extra fine-tuning or big performance loss.

Who Should Care

Summary TLDR

The authors show that moderate WANDA pruning of attention weights (roughly 10–20%) on 7B models can raise refusal rates to malicious prompts without any further fine-tuning and while keeping benchmark performance nearly intact. They curated 225 malicious tasks (2,250 prompt samples), tested LLaMA-2-7B-Chat, Vicuna-1.3-7B, and Mistral-7B Instruct, and analyzed attention entropy and perplexity shifts to argue pruning acts as a regularizer that helps detect unnatural jailbreak constructs.

Problem Statement

How does model compression affect an aligned LLM's susceptibility to jailbreak attacks? The paper asks whether pruning (WANDA) can increase jailbreak resistance without additional training and without degrading standard task performance.

Main Contribution

Curated a safety-focused dataset: 225 malicious tasks split into 5 categories and embedded in 10 jailbreak templates (2,250 samples).

Showed attention-layer WANDA pruning (10–20%) raises refusal rates on 7B models without fine-tuning, with peak benefits at ~20% sparsity.

Key Findings

Moderate attention-layer WANDA pruning increases refusal rates to jailbreak prompts.

NumbersLLaMA-2: average +8.5% refusal rate across five categories

Practical UseTry 10–20% attention pruning to improve refusal behavior for safety-trained 7B models before doing extra fine-tuning.

Evidence RefIntroduction; Section 4.1; Figure 1

Safety benefits peak near 20% sparsity and reverse after heavier pruning.

NumbersResistance improves up to 20% then degrades by 30% sparsity

Practical UseAvoid aggressive pruning (>30%); tune sparsity around 10–20% and monitor refusals and benchmarks.

Evidence RefFigures 3 and 5; Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Refusal rate (safety)LLaMA-2: avg +8.5% refusal (post-pruning)unpruned LLaMA-2 Chat+8.5% (average across five categories)Custom malicious tasks (225 tasks, 2250 samples)Introduction; Section 4.1; Figure 1Figure 1
Perplexity (WikiText)6.9437.158 (20% sparsity)unpruned LLaMA-2 (6.943)+0.215WikiText perplexityTable 1; Section 4.3Table 1

What To Try In 7 Days

Run attention-layer WANDA pruning at 10% and 20% on a copy of your 7B model and keep the original.

Measure refusal rate on an internal set of risky prompts and compare to unpruned baseline.

Validate core business benchmarks (MMLU, task-specific tests) to ensure capability is preserved (>look for small deltas).

Optimization Features

Infra Optimization
lower memory footprint from pruned weights
Model Optimization
WANDA pruning (attention-layer, 10–30% sparsity)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to 7B models; effects on larger models are unknown.

Only WANDA pruning evaluated; other compression methods may differ.

When Not To Use

Do not rely on pruning as the sole safety measure for poorly aligned models.

Avoid pruning >30% sparsity in production as it can reduce safety and capabilities.

Failure Modes

Over-pruning (>30%) reduces alignment and can increase harmful outputs.

Models with little or no prior safety training (e.g., no RLHF) may see little benefit.

Core Entities

Models

LLaMA-2-7B-ChatVicuna-1.3-7BMistral-Instruct-v0.2-7BWANDA pruning (attention-layer)

Metrics

refusal rateperplexityattention entropyIgnoreJailbreak metricAccuracyMSE (linear models)

Datasets

Custom malicious tasks dataset (225 tasks, 2,250 samples)AdvBench harmful behavior datasetWikiText (perplexity)Open LLM Leaderboard tasks (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K, AltQA)

Benchmarks

Open LLM Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K, AltQA)AdvBench

Context Entities

Models

LLaMA-2 base models and RLHF-aligned versionsVicuna and Mistral fine-tuned variants

Metrics

p-values for GCG attack experimentsper-sparsity attention/head statistics

Datasets

AdvBench (for automated attacks)AltQA (effective context length test)

Benchmarks

Open LLM Leaderboard components