Use each layer's outlier count to set non-uniform sparsity for much better LLM pruning

October 8, 20239 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

2

Authors

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, Michael Bendersky, Zhangyang Wang, Shiwei Liu

Links

Abstract / PDF

Why It Matters For Business

OWL lets you prune large language models up to ~70% while keeping useful quality and delivering real CPU speedups, enabling cheaper, faster inference and easier deployment on constrained hardware.

Summary TLDR

OWL sets layerwise pruning ratios proportional to each layer's 'outlier' weight count (large-magnitude contributions). Applied as a drop-in replacement for uniform layer sparsity in existing one-shot pruning recipes (Wanda, SparseGPT), OWL sharply reduces perplexity at high sparsity (e.g., LLaMA-7B: Wanda 85.77 → OWL 24.55 at 70% sparsity) and yields real CPU speedups (≈2.6× at 70%). OWL is cheap to compute and works across LLaMA/OPT families; it also guides structured pruning, N:M patterns, SVD and mixed-precision choices.

Problem Statement

One-shot unstructured pruning for large language models usually uses the same sparsity per layer, but layers have very different counts of important "outlier" features (large-magnitude contributions). Uniform layer sparsity can remove critical outliers and harm quality at high sparsity. We need a cheap, data-driven way to set non-uniform layer sparsity that preserves outliers and improves pruned LLM performance.

Main Contribution

Identify strong layerwise non-uniformity of weight outliers in LLMs and link outlier retention to pruning quality.

Propose OWL: compute each layer's outlier ratio and set layer sparsity proportional to 1 - outlier_ratio, constrained by a small range λ around the global target sparsity.

Show OWL as a drop-in layerwise-sparsity module for Wanda and SparseGPT, improving perplexity and zero-shot accuracy at high sparsity (50%–80%) with minimal extra compute.

Key Findings

OWL sharply lowers perplexity vs Wanda at 70% sparsity on LLaMA-7B.

NumbersWanda 85.77 → OWL w. Wanda 24.55 (∆ −61.22) on WikiText

OWL improves strong second-order pruning (SparseGPT) at 70% sparsity.

NumbersSparseGPT 26.30 → OWL w. SparseGPT 19.49 (∆ −6.81) on LLaMA-7B

Outlier retention correlates with pruning success; magnitude pruning removes outliers and fails.

NumbersMagnitude pruning LOD change −0.110; massive perplexity (e.g., 84539 on 13B at 70%)

OWL yields real CPU inference speedups when used to produce sparse models.

NumbersEnd-to-end decode speedup ≈2.6× at 70% sparsity; up to 3.9× at 90% (DeepSparse on CPU)

Short LoRA fine-tuning recovers much of the remaining quality loss after OWL pruning.

NumbersOWL w. SparseGPT LLaMA-7B: perplexity 19.49 → 11.15 after LoRA on 30k tokens

Outlier distribution across layers is highly non-uniform (U-shaped), motivating per-layer treatment.

NumbersLayerwise Outlier Distribution (LOD) shows peaks near early/late layers (empirical plots/tables)

Results

Perplexity (WikiText)

ValueOWL w. Wanda 24.55 at 70% (LLaMA-7B)

BaselineWanda 85.77 at 70% (LLaMA-7B)

Perplexity (WikiText)

ValueOWL w. SparseGPT 19.49 at 70% (LLaMA-7B)

BaselineSparseGPT 26.30 at 70% (LLaMA-7B)

Accuracy

ValueOWL w. Wanda mean 46.47% at 70% (LLaMA-7B)

BaselineWanda mean 39.39% at 70% (LLaMA-7B)

End-to-end decode speedup (DeepSparse CPU)

Value≈2.6× at 70% sparsity

Baselinedense model (1.0×)

LoRA

Value19.49 → 11.15 (LLaMA-7B, 70%)

BaselineOWL w. SparseGPT without FT 19.49

Who Should Care

What To Try In 7 Days

Measure layerwise outlier ratio (LOD) on your model using the paper's A_ij = ||X_j||_2 * |W_ij| rule.

Plug OWL's per-block sparsity into your Wanda or SparseGPT pipeline and prune at 50%–70%, then validate perplexity on WikiText or your in-domain set.

Run an end-to-end CPU inference test (DeepSparse or similar) to confirm latency and throughput gains at your target sparsity.

Optimization Features

Infra Optimization

  • Demonstrated end-to-end gains on Intel Xeon + DeepSparse
  • Better suitability for CPU/FPGA/other commodity hardware

Model Optimization

  • Unstructured weight pruning
  • Layerwise sparsity set by outlier ratios
  • Per-block (transformer block) granularity

System Optimization

  • Small extra compute for OWL (≈0–2s overhead vs Wanda)
  • Compatible with Wanda and SparseGPT pipelines

Training Optimization

  • LoRA

Inference Optimization

  • Enables CPU speedups via sparse kernels (DeepSparse measured)
  • Guides structured pruning and N:M patterns for hardware

Reproducibility

Data Urls

  • WikiText
  • C4

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • OWL relies on detectable outliers; benefits are smaller on models or domains without clear outlier structure (vision models showed weaker gains).
  • Unstructured sparsity still has limited GPU support; best runtime gains seen on CPU/DeepSparse.
  • OWL needs a small calibration dataset and two hyperparameters (M, λ) that require light tuning per model size.

When Not To Use

  • If your deployment stack cannot exploit unstructured sparsity (GPU kernels lack support), OWL's accuracy gains may not translate to speed.
  • When model layers lack clear outlier dimensions (e.g., some vision models), OWL may add complexity without benefit.
  • If you cannot run any calibration data pass (required to compute LOD), OWL cannot be applied.

Failure Modes

  • Inverting the outlier ratio (OWL-inverse) severely degrades quality at high sparsity.
  • Applying OWL per-layer instead of per-block can produce suboptimal, nearly uniform sparsity and poor perplexity.
  • Combining OWL with naive magnitude pruning removes outliers and produces catastrophic perplexity increases.

Core Entities

Models

  • LLaMA-V1 (7B/13B/30B/65B)
  • OPT-6.7B
  • LLaMA-V2-7B-chat-hf
  • Vicuna-7B
  • Mistral-7B

Metrics

  • Perplexity
  • Accuracy
  • End-to-end decode latency / throughput (tokens/sec)
  • Layerwise Outlier Distribution (LOD)

Datasets

  • WikiText (validation)
  • C4 (calibration data)
  • BoolQ
  • RTE
  • HellaSwag
  • WinoGrande
  • ARC Easy
  • ARC Challenge
  • OpenbookQA

Benchmarks

  • WikiText perplexity
  • Accuracy
  • ImageNet-1K (vision appendix)