Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
STUN cuts serving memory and GPUs for large MoE models while keeping generation quality, enabling cheaper deployment of very large sparse models without costly retraining.
Summary TLDR
STUN is a two‑phase pruning recipe for Mixture‑of‑Experts (MoE) models: (1) fast, scalable expert-level (structured) pruning using clustering and Taylor approximations that reduces GPU forward calls to O(1), then (2) standard unstructured pruning inside remaining experts. Empirically on Snowflake Arctic (480B, 128 experts) and Mixtral models, STUN achieves much higher retained accuracy at high sparsity (e.g., Arctic: 40% sparsity with near‑no loss on GSM8K) while needing only one H100 for ~1–2 hours. Code is public.
Problem Statement
MoEs cut compute by activating a few experts, but they still hold huge parameter counts and demand large serving memory. Existing structured expert pruning either scales poorly (exhaustive combinatorics) or underperforms; unstructured pruning can be effective but fails at high sparsity or on generative tasks. We need a pruning method that scales to hundreds of experts and keeps quality at high sparsity without costly GPU work.
Main Contribution
STUN: a two‑phase pruning pipeline — expert-level structured pruning first, then unstructured pruning inside survivors.
Scalable expert pruning that approximates combinatorial search to achieve O(1) GPU forward calls using router‑weight clustering and 1st‑order Taylor approximations.
Empirical proof that expert pruning preserves or increases intra‑expert robustness (kurtosis), so following unstructured pruning is effective.
Practical recipe: one H100 and ~2 hours to prune Snowflake Arctic to ~40% sparsity with near‑no performance loss; code released.
Key Findings
STUN retains GSM8K generation accuracy at 40% sparsity on Snowflake Arctic with almost no loss.
STUN substantially outperforms unstructured pruning at high sparsity for generative tasks.
Expert pruning phase reduced the pruning compute to O(1) GPU calls and is practical on one GPU.
Expert pruning preserves or increases weight kurtosis, which keeps models robust to later unstructured pruning.
STUN works across MoE and some non‑MoE models and scales better when models have many small experts.
Results
Accuracy
Accuracy
Accuracy
Pruning runtime (Snowflake Arctic expert pruning)
Who Should Care
What To Try In 7 Days
Run STUN expert pruning on one checkpoint of your MoE (use router rows for clustering) to test 10–40% expert removal.
After expert pruning, run your current unstructured pruner (Wanda/OWL) on the survivors and compare end‑to‑end quality vs baseline.
Measure runtime and GPU count: verify single‑GPU pruning and evaluate cost savings for your serving fleet.
Optimization Features
Infra Optimization
- single‑GPU pruning (H100) in ~0.6–1.1 hours for Arctic
- reduces number of GPUs required for pruning compared to combinatorial search
Model Optimization
- expert-level structured pruning
- structured-then-unstructured pruning (STUN)
- clustering of experts from router weights
- 1st-order Taylor approximation for selective reconstruction
System Optimization
- no backpropagation or fine‑tuning required
- compatible with existing unstructured pruners (Wanda, OWL)
Inference Optimization
- reduces GPU forward calls to O(1) for expert search
- enables pruning on a single H100
- preserves model quality at high sparsity so fewer GPUs needed at serving
Reproducibility
Code Urls
Data Urls
- C4 dataset (used for calibration)
- LM-Evaluation-Harness (evaluation)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Final speedups depend on hardware support for unstructured sparsity; GPUs may not see full runtime gains today.
- Evaluations use a fixed set of tasks and C4 calibration; behavior under heavily skewed or domain-shifted data is not fully characterized.
- Second phase relies on existing unstructured pruners; strengths inherit their limitations.
When Not To Use
- If your target inference hardware cannot accelerate unstructured sparsity, end‑to‑end latency may not improve.
- If you require strict formal guarantees on worst‑case behavior under distribution shift without further validation.
Failure Modes
- At extreme sparsity choices, unstructured pruning may still collapse generation quality if not tuned.
- If router weights do not reflect true expert behavior for your data, clustering may remove useful experts.
- Sparse acceleration is hardware and library dependent; storage reduction may not translate to latency reduction.
Core Entities
Models
- Snowflake Arctic (480B, 128 experts)
- Mixtral-8x7B
- Mixtral-8x22B
Metrics
- Accuracy
- LM-eval average
Datasets
- GSM8K
- ARC-challenge
- ARC-easy
- HellaSwag
- MMLU
- BoolQ
- OpenBookQA
- RTE
- WinoGrande
- C4
Benchmarks
- LM-Evaluation-Harness

