Overview
OWL is a low-cost change to compute layer sparsity from outlier counts and plug into existing pruning pipelines; experiments show consistent gains across LLaMA/OPT sizes and real CPU speedups, but gains scale with presence of outliers and calibration data.
Citations2
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
OWL lets you prune large language models up to ~70% while keeping useful quality and delivering real CPU speedups, enabling cheaper, faster inference and easier deployment on constrained hardware.
Who Should Care
Summary TLDR
OWL sets layerwise pruning ratios proportional to each layer's 'outlier' weight count (large-magnitude contributions). Applied as a drop-in replacement for uniform layer sparsity in existing one-shot pruning recipes (Wanda, SparseGPT), OWL sharply reduces perplexity at high sparsity (e.g., LLaMA-7B: Wanda 85.77 → OWL 24.55 at 70% sparsity) and yields real CPU speedups (≈2.6× at 70%). OWL is cheap to compute and works across LLaMA/OPT families; it also guides structured pruning, N:M patterns, SVD and mixed-precision choices.
Problem Statement
One-shot unstructured pruning for large language models usually uses the same sparsity per layer, but layers have very different counts of important "outlier" features (large-magnitude contributions). Uniform layer sparsity can remove critical outliers and harm quality at high sparsity. We need a cheap, data-driven way to set non-uniform layer sparsity that preserves outliers and improves pruned LLM performance.
Main Contribution
Identify strong layerwise non-uniformity of weight outliers in LLMs and link outlier retention to pruning quality.
Propose OWL: compute each layer's outlier ratio and set layer sparsity proportional to 1 - outlier_ratio, constrained by a small range λ around the global target sparsity.
Key Findings
OWL sharply lowers perplexity vs Wanda at 70% sparsity on LLaMA-7B.
OWL improves strong second-order pruning (SparseGPT) at 70% sparsity.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (WikiText) | OWL w. Wanda 24.55 at 70% (LLaMA-7B) | Wanda 85.77 at 70% (LLaMA-7B) | −61.22 | WikiText validation | Table 3 (LLaMA-7B) | Table 3 |
| Perplexity (WikiText) | OWL w. SparseGPT 19.49 at 70% (LLaMA-7B) | SparseGPT 26.30 at 70% (LLaMA-7B) | −6.81 | WikiText validation | Table 3 (LLaMA-7B) | Table 3 |
What To Try In 7 Days
Measure layerwise outlier ratio (LOD) on your model using the paper's A_ij = ||X_j||_2 * |W_ij| rule.
Plug OWL's per-block sparsity into your Wanda or SparseGPT pipeline and prune at 50%–70%, then validate perplexity on WikiText or your in-domain set.
Run an end-to-end CPU inference test (DeepSparse or similar) to confirm latency and throughput gains at your target sparsity.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
OWL relies on detectable outliers; benefits are smaller on models or domains without clear outlier structure (vision models showed weaker gains).
Unstructured sparsity still has limited GPU support; best runtime gains seen on CPU/DeepSparse.
When Not To Use
If your deployment stack cannot exploit unstructured sparsity (GPU kernels lack support), OWL's accuracy gains may not translate to speed.
When model layers lack clear outlier dimensions (e.g., some vision models), OWL may add complexity without benefit.
Failure Modes
Inverting the outlier ratio (OWL-inverse) severely degrades quality at high sparsity.
Applying OWL per-layer instead of per-block can produce suboptimal, nearly uniform sparsity and poor perplexity.

