Moderate WANDA pruning (10–20%) increases jailbreak resistance of 7B LLMs without fine-tuning

Overview

Decision SnapshotNeeds Validation

Results are practical: modest attention pruning gives measurable safety gains on 7B models without fine-tuning and with small benchmark impact; evidence is limited to 7B models, one pruning method, and curated datasets.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Adib Hasan, Ileana Rugina, Alex Wang

Links

Abstract / PDF

Why It Matters For Business

Pruning attention weights at modest sparsity (10–20%) is a low-cost safety lever: it can raise refusal rates to harmful prompts and shrink model size without extra fine-tuning or big performance loss.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead

Summary TLDR

The authors show that moderate WANDA pruning of attention weights (roughly 10–20%) on 7B models can raise refusal rates to malicious prompts without any further fine-tuning and while keeping benchmark performance nearly intact. They curated 225 malicious tasks (2,250 prompt samples), tested LLaMA-2-7B-Chat, Vicuna-1.3-7B, and Mistral-7B Instruct, and analyzed attention entropy and perplexity shifts to argue pruning acts as a regularizer that helps detect unnatural jailbreak constructs.

Problem Statement

How does model compression affect an aligned LLM's susceptibility to jailbreak attacks? The paper asks whether pruning (WANDA) can increase jailbreak resistance without additional training and without degrading standard task performance.

Main Contribution

Curated a safety-focused dataset: 225 malicious tasks split into 5 categories and embedded in 10 jailbreak templates (2,250 samples).

Showed attention-layer WANDA pruning (10–20%) raises refusal rates on 7B models without fine-tuning, with peak benefits at ~20% sparsity.

Key Findings

Moderate attention-layer WANDA pruning increases refusal rates to jailbreak prompts.

NumbersLLaMA-2: average +8.5% refusal rate across five categories

Practical UseTry 10–20% attention pruning to improve refusal behavior for safety-trained 7B models before doing extra fine-tuning.

Evidence RefIntroduction; Section 4.1; Figure 1

Safety benefits peak near 20% sparsity and reverse after heavier pruning.

NumbersResistance improves up to 20% then degrades by 30% sparsity

Practical UseAvoid aggressive pruning (>30%); tune sparsity around 10–20% and monitor refusals and benchmarks.

Evidence RefFigures 3 and 5; Section 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Refusal rate (safety)	LLaMA-2: avg +8.5% refusal (post-pruning)	unpruned LLaMA-2 Chat	+8.5% (average across five categories)	Custom malicious tasks (225 tasks, 2250 samples)	Introduction; Section 4.1; Figure 1	Figure 1
Perplexity (WikiText)	6.943 → 7.158 (20% sparsity)	unpruned LLaMA-2 (6.943)	+0.215	WikiText perplexity	Table 1; Section 4.3	Table 1

What To Try In 7 Days

Run attention-layer WANDA pruning at 10% and 20% on a copy of your 7B model and keep the original.

Measure refusal rate on an internal set of risky prompts and compare to unpruned baseline.

Validate core business benchmarks (MMLU, task-specific tests) to ensure capability is preserved (>look for small deltas).

Optimization Features

Infra Optimization

lower memory footprint from pruned weights

Model Optimization

WANDA pruning (attention-layer, 10–30% sparsity)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to 7B models; effects on larger models are unknown.

Only WANDA pruning evaluated; other compression methods may differ.

When Not To Use

Do not rely on pruning as the sole safety measure for poorly aligned models.

Avoid pruning >30% sparsity in production as it can reduce safety and capabilities.

Failure Modes

Over-pruning (>30%) reduces alignment and can increase harmful outputs.

Models with little or no prior safety training (e.g., no RLHF) may see little benefit.

Core Entities

Models

LLaMA-2-7B-ChatVicuna-1.3-7BMistral-Instruct-v0.2-7BWANDA pruning (attention-layer)

Metrics

refusal rateperplexityattention entropyIgnoreJailbreak metricAccuracyMSE (linear models)

Datasets

Custom malicious tasks dataset (225 tasks, 2,250 samples)AdvBench harmful behavior datasetWikiText (perplexity)Open LLM Leaderboard tasks (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K, AltQA)

Benchmarks

Open LLM Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K, AltQA)AdvBench

Context Entities

Models

LLaMA-2 base models and RLHF-aligned versionsVicuna and Mistral fine-tuned variants

Metrics

p-values for GCG attack experimentsper-sparsity attention/head statistics

Datasets

AdvBench (for automated attacks)AltQA (effective context length test)

Benchmarks

Open LLM Leaderboard components

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Moderate attention-layer WANDA pruning increases refusal rates to jailbreak prompts.

Safety benefits peak near 20% sparsity and reverse after heavier pruning.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding