Use each layer's outlier count to set non-uniform sparsity for much better LLM pruning

October 8, 20239 min

Overview

Decision SnapshotReady For Pilot

OWL is a low-cost change to compute layer sparsity from outlier counts and plug into existing pruning pipelines; experiments show consistent gains across LLaMA/OPT sizes and real CPU speedups, but gains scale with presence of outliers and calibration data.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, Shiwei Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

OWL lets you prune large language models up to ~70% while keeping useful quality and delivering real CPU speedups, enabling cheaper, faster inference and easier deployment on constrained hardware.

Who Should Care

Summary TLDR

OWL sets layerwise pruning ratios proportional to each layer's 'outlier' weight count (large-magnitude contributions). Applied as a drop-in replacement for uniform layer sparsity in existing one-shot pruning recipes (Wanda, SparseGPT), OWL sharply reduces perplexity at high sparsity (e.g., LLaMA-7B: Wanda 85.77 → OWL 24.55 at 70% sparsity) and yields real CPU speedups (≈2.6× at 70%). OWL is cheap to compute and works across LLaMA/OPT families; it also guides structured pruning, N:M patterns, SVD and mixed-precision choices.

Problem Statement

One-shot unstructured pruning for large language models usually uses the same sparsity per layer, but layers have very different counts of important "outlier" features (large-magnitude contributions). Uniform layer sparsity can remove critical outliers and harm quality at high sparsity. We need a cheap, data-driven way to set non-uniform layer sparsity that preserves outliers and improves pruned LLM performance.

Main Contribution

Identify strong layerwise non-uniformity of weight outliers in LLMs and link outlier retention to pruning quality.

Propose OWL: compute each layer's outlier ratio and set layer sparsity proportional to 1 - outlier_ratio, constrained by a small range λ around the global target sparsity.

Key Findings

OWL sharply lowers perplexity vs Wanda at 70% sparsity on LLaMA-7B.

NumbersWanda 85.77 → OWL w. Wanda 24.55 (∆ −61.22) on WikiText

Practical UseReplace uniform layer sparsity with OWL to cut language-model perplexity massively at high sparsity for small-to-medium LLMs.

Evidence RefTable 3 (LLaMA-7B, 70%)

OWL improves strong second-order pruning (SparseGPT) at 70% sparsity.

NumbersSparseGPT 26.30 → OWL w. SparseGPT 19.49 (∆ −6.81) on LLaMA-7B

Practical UseEven for advanced pruning methods that update weights, tuning per-layer sparsity via OWL yields measurable gains; combine OWL with existing methods.

Evidence RefTable 3 (LLaMA-7B, 70%)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (WikiText)OWL w. Wanda 24.55 at 70% (LLaMA-7B)Wanda 85.77 at 70% (LLaMA-7B)−61.22WikiText validationTable 3 (LLaMA-7B)Table 3
Perplexity (WikiText)OWL w. SparseGPT 19.49 at 70% (LLaMA-7B)SparseGPT 26.30 at 70% (LLaMA-7B)−6.81WikiText validationTable 3 (LLaMA-7B)Table 3

What To Try In 7 Days

Measure layerwise outlier ratio (LOD) on your model using the paper's A_ij = ||X_j||_2 * |W_ij| rule.

Plug OWL's per-block sparsity into your Wanda or SparseGPT pipeline and prune at 50%–70%, then validate perplexity on WikiText or your in-domain set.

Run an end-to-end CPU inference test (DeepSparse or similar) to confirm latency and throughput gains at your target sparsity.

Optimization Features

Infra Optimization
Demonstrated end-to-end gains on Intel Xeon + DeepSparseBetter suitability for CPU/FPGA/other commodity hardware
Model Optimization
Unstructured weight pruningLayerwise sparsity set by outlier ratiosPer-block (transformer block) granularity
System Optimization
Small extra compute for OWL (≈0–2s overhead vs Wanda)Compatible with Wanda and SparseGPT pipelines
Training Optimization
LoRA
Inference Optimization
Enables CPU speedups via sparse kernels (DeepSparse measured)Guides structured pruning and N:M patterns for hardware

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Data URLs

WikiTextC4

Risks & Boundaries

Limitations

OWL relies on detectable outliers; benefits are smaller on models or domains without clear outlier structure (vision models showed weaker gains).

Unstructured sparsity still has limited GPU support; best runtime gains seen on CPU/DeepSparse.

When Not To Use

If your deployment stack cannot exploit unstructured sparsity (GPU kernels lack support), OWL's accuracy gains may not translate to speed.

When model layers lack clear outlier dimensions (e.g., some vision models), OWL may add complexity without benefit.

Failure Modes

Inverting the outlier ratio (OWL-inverse) severely degrades quality at high sparsity.

Applying OWL per-layer instead of per-block can produce suboptimal, nearly uniform sparsity and poor perplexity.

Core Entities

Models

LLaMA-V1 (7B/13B/30B/65B)OPT-6.7BLLaMA-V2-7B-chat-hfVicuna-7BMistral-7B

Metrics

PerplexityAccuracyEnd-to-end decode latency / throughput (tokens/sec)Layerwise Outlier Distribution (LOD)

Datasets

WikiText (validation)C4 (calibration data)BoolQRTEHellaSwagWinoGrandeARC EasyARC ChallengeOpenbookQA

Benchmarks

WikiText perplexityAccuracyImageNet-1K (vision appendix)