Use each layer's outlier count to set non-uniform sparsity for much better LLM pruning

Overview

Decision SnapshotReady For Pilot

OWL is a low-cost change to compute layer sparsity from outlier counts and plug into existing pruning pipelines; experiments show consistent gains across LLaMA/OPT sizes and real CPU speedups, but gains scale with presence of outliers and calibration data.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, Shiwei Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OWL lets you prune large language models up to ~70% while keeping useful quality and delivering real CPU speedups, enabling cheaper, faster inference and easier deployment on constrained hardware.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Data Scientist

Summary TLDR

OWL sets layerwise pruning ratios proportional to each layer's 'outlier' weight count (large-magnitude contributions). Applied as a drop-in replacement for uniform layer sparsity in existing one-shot pruning recipes (Wanda, SparseGPT), OWL sharply reduces perplexity at high sparsity (e.g., LLaMA-7B: Wanda 85.77 → OWL 24.55 at 70% sparsity) and yields real CPU speedups (≈2.6× at 70%). OWL is cheap to compute and works across LLaMA/OPT families; it also guides structured pruning, N:M patterns, SVD and mixed-precision choices.

Problem Statement

One-shot unstructured pruning for large language models usually uses the same sparsity per layer, but layers have very different counts of important "outlier" features (large-magnitude contributions). Uniform layer sparsity can remove critical outliers and harm quality at high sparsity. We need a cheap, data-driven way to set non-uniform layer sparsity that preserves outliers and improves pruned LLM performance.

Main Contribution

Identify strong layerwise non-uniformity of weight outliers in LLMs and link outlier retention to pruning quality.

Propose OWL: compute each layer's outlier ratio and set layer sparsity proportional to 1 - outlier_ratio, constrained by a small range λ around the global target sparsity.

Key Findings

OWL sharply lowers perplexity vs Wanda at 70% sparsity on LLaMA-7B.

NumbersWanda 85.77 → OWL w. Wanda 24.55 (∆ −61.22) on WikiText

Practical UseReplace uniform layer sparsity with OWL to cut language-model perplexity massively at high sparsity for small-to-medium LLMs.

Evidence RefTable 3 (LLaMA-7B, 70%)

OWL improves strong second-order pruning (SparseGPT) at 70% sparsity.

NumbersSparseGPT 26.30 → OWL w. SparseGPT 19.49 (∆ −6.81) on LLaMA-7B

Practical UseEven for advanced pruning methods that update weights, tuning per-layer sparsity via OWL yields measurable gains; combine OWL with existing methods.

Evidence RefTable 3 (LLaMA-7B, 70%)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (WikiText)	OWL w. Wanda 24.55 at 70% (LLaMA-7B)	Wanda 85.77 at 70% (LLaMA-7B)	−61.22	WikiText validation	Table 3 (LLaMA-7B)	Table 3
Perplexity (WikiText)	OWL w. SparseGPT 19.49 at 70% (LLaMA-7B)	SparseGPT 26.30 at 70% (LLaMA-7B)	−6.81	WikiText validation	Table 3 (LLaMA-7B)	Table 3

What To Try In 7 Days

Measure layerwise outlier ratio (LOD) on your model using the paper's A_ij = ||X_j||_2 * |W_ij| rule.

Plug OWL's per-block sparsity into your Wanda or SparseGPT pipeline and prune at 50%–70%, then validate perplexity on WikiText or your in-domain set.

Run an end-to-end CPU inference test (DeepSparse or similar) to confirm latency and throughput gains at your target sparsity.

Optimization Features

Infra Optimization

Demonstrated end-to-end gains on Intel Xeon + DeepSparseBetter suitability for CPU/FPGA/other commodity hardware

Model Optimization

Unstructured weight pruningLayerwise sparsity set by outlier ratiosPer-block (transformer block) granularity

System Optimization

Small extra compute for OWL (≈0–2s overhead vs Wanda)Compatible with Wanda and SparseGPT pipelines

Training Optimization

LoRA

Inference Optimization

Enables CPU speedups via sparse kernels (DeepSparse measured)Guides structured pruning and N:M patterns for hardware

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/luuyin/OWL.git

Data URLs

WikiTextC4

Risks & Boundaries

Limitations

OWL relies on detectable outliers; benefits are smaller on models or domains without clear outlier structure (vision models showed weaker gains).

Unstructured sparsity still has limited GPU support; best runtime gains seen on CPU/DeepSparse.

When Not To Use

If your deployment stack cannot exploit unstructured sparsity (GPU kernels lack support), OWL's accuracy gains may not translate to speed.

When model layers lack clear outlier dimensions (e.g., some vision models), OWL may add complexity without benefit.

Failure Modes

Inverting the outlier ratio (OWL-inverse) severely degrades quality at high sparsity.

Applying OWL per-layer instead of per-block can produce suboptimal, nearly uniform sparsity and poor perplexity.

Core Entities

Models

LLaMA-V1 (7B/13B/30B/65B)OPT-6.7BLLaMA-V2-7B-chat-hfVicuna-7BMistral-7B

Metrics

PerplexityAccuracyEnd-to-end decode latency / throughput (tokens/sec)Layerwise Outlier Distribution (LOD)

Datasets

WikiText (validation)C4 (calibration data)BoolQRTEHellaSwagWinoGrandeARC EasyARC ChallengeOpenbookQA

Benchmarks

WikiText perplexityAccuracyImageNet-1K (vision appendix)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

OWL sharply lowers perplexity vs Wanda at 70% sparsity on LLaMA-7B.

OWL improves strong second-order pruning (SparseGPT) at 70% sparsity.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding