Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
OWL lets you prune large language models up to ~70% while keeping useful quality and delivering real CPU speedups, enabling cheaper, faster inference and easier deployment on constrained hardware.
Summary TLDR
OWL sets layerwise pruning ratios proportional to each layer's 'outlier' weight count (large-magnitude contributions). Applied as a drop-in replacement for uniform layer sparsity in existing one-shot pruning recipes (Wanda, SparseGPT), OWL sharply reduces perplexity at high sparsity (e.g., LLaMA-7B: Wanda 85.77 → OWL 24.55 at 70% sparsity) and yields real CPU speedups (≈2.6× at 70%). OWL is cheap to compute and works across LLaMA/OPT families; it also guides structured pruning, N:M patterns, SVD and mixed-precision choices.
Problem Statement
One-shot unstructured pruning for large language models usually uses the same sparsity per layer, but layers have very different counts of important "outlier" features (large-magnitude contributions). Uniform layer sparsity can remove critical outliers and harm quality at high sparsity. We need a cheap, data-driven way to set non-uniform layer sparsity that preserves outliers and improves pruned LLM performance.
Main Contribution
Identify strong layerwise non-uniformity of weight outliers in LLMs and link outlier retention to pruning quality.
Propose OWL: compute each layer's outlier ratio and set layer sparsity proportional to 1 - outlier_ratio, constrained by a small range λ around the global target sparsity.
Show OWL as a drop-in layerwise-sparsity module for Wanda and SparseGPT, improving perplexity and zero-shot accuracy at high sparsity (50%–80%) with minimal extra compute.
Key Findings
OWL sharply lowers perplexity vs Wanda at 70% sparsity on LLaMA-7B.
OWL improves strong second-order pruning (SparseGPT) at 70% sparsity.
Outlier retention correlates with pruning success; magnitude pruning removes outliers and fails.
OWL yields real CPU inference speedups when used to produce sparse models.
Short LoRA fine-tuning recovers much of the remaining quality loss after OWL pruning.
Outlier distribution across layers is highly non-uniform (U-shaped), motivating per-layer treatment.
Results
Perplexity (WikiText)
Perplexity (WikiText)
Accuracy
End-to-end decode speedup (DeepSparse CPU)
LoRA
Who Should Care
What To Try In 7 Days
Measure layerwise outlier ratio (LOD) on your model using the paper's A_ij = ||X_j||_2 * |W_ij| rule.
Plug OWL's per-block sparsity into your Wanda or SparseGPT pipeline and prune at 50%–70%, then validate perplexity on WikiText or your in-domain set.
Run an end-to-end CPU inference test (DeepSparse or similar) to confirm latency and throughput gains at your target sparsity.
Optimization Features
Infra Optimization
- Demonstrated end-to-end gains on Intel Xeon + DeepSparse
- Better suitability for CPU/FPGA/other commodity hardware
Model Optimization
- Unstructured weight pruning
- Layerwise sparsity set by outlier ratios
- Per-block (transformer block) granularity
System Optimization
- Small extra compute for OWL (≈0–2s overhead vs Wanda)
- Compatible with Wanda and SparseGPT pipelines
Training Optimization
- LoRA
Inference Optimization
- Enables CPU speedups via sparse kernels (DeepSparse measured)
- Guides structured pruning and N:M patterns for hardware
Reproducibility
Code Urls
Data Urls
- WikiText
- C4
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- OWL relies on detectable outliers; benefits are smaller on models or domains without clear outlier structure (vision models showed weaker gains).
- Unstructured sparsity still has limited GPU support; best runtime gains seen on CPU/DeepSparse.
- OWL needs a small calibration dataset and two hyperparameters (M, λ) that require light tuning per model size.
When Not To Use
- If your deployment stack cannot exploit unstructured sparsity (GPU kernels lack support), OWL's accuracy gains may not translate to speed.
- When model layers lack clear outlier dimensions (e.g., some vision models), OWL may add complexity without benefit.
- If you cannot run any calibration data pass (required to compute LOD), OWL cannot be applied.
Failure Modes
- Inverting the outlier ratio (OWL-inverse) severely degrades quality at high sparsity.
- Applying OWL per-layer instead of per-block can produce suboptimal, nearly uniform sparsity and poor perplexity.
- Combining OWL with naive magnitude pruning removes outliers and produces catastrophic perplexity increases.
Core Entities
Models
- LLaMA-V1 (7B/13B/30B/65B)
- OPT-6.7B
- LLaMA-V2-7B-chat-hf
- Vicuna-7B
- Mistral-7B
Metrics
- Perplexity
- Accuracy
- End-to-end decode latency / throughput (tokens/sec)
- Layerwise Outlier Distribution (LOD)
Datasets
- WikiText (validation)
- C4 (calibration data)
- BoolQ
- RTE
- HellaSwag
- WinoGrande
- ARC Easy
- ARC Challenge
- OpenbookQA
Benchmarks
- WikiText perplexity
- Accuracy
- ImageNet-1K (vision appendix)

