Overview
The method is practically useful: it runs on one GPU, is evaluated on large MoEs (including a 480B model), and shows consistent gains; caveats include hardware support for sparse inference and limited distributional tests.
Citations0
Evidence Strength0.75
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
STUN cuts serving memory and GPUs for large MoE models while keeping generation quality, enabling cheaper deployment of very large sparse models without costly retraining.
Who Should Care
Summary TLDR
STUN is a two‑phase pruning recipe for Mixture‑of‑Experts (MoE) models: (1) fast, scalable expert-level (structured) pruning using clustering and Taylor approximations that reduces GPU forward calls to O(1), then (2) standard unstructured pruning inside remaining experts. Empirically on Snowflake Arctic (480B, 128 experts) and Mixtral models, STUN achieves much higher retained accuracy at high sparsity (e.g., Arctic: 40% sparsity with near‑no loss on GSM8K) while needing only one H100 for ~1–2 hours. Code is public.
Problem Statement
MoEs cut compute by activating a few experts, but they still hold huge parameter counts and demand large serving memory. Existing structured expert pruning either scales poorly (exhaustive combinatorics) or underperforms; unstructured pruning can be effective but fails at high sparsity or on generative tasks. We need a pruning method that scales to hundreds of experts and keeps quality at high sparsity without costly GPU work.
Main Contribution
STUN: a two‑phase pruning pipeline — expert-level structured pruning first, then unstructured pruning inside survivors.
Scalable expert pruning that approximates combinatorial search to achieve O(1) GPU forward calls using router‑weight clustering and 1st‑order Taylor approximations.
Key Findings
STUN retains GSM8K generation accuracy at 40% sparsity on Snowflake Arctic with almost no loss.
STUN substantially outperforms unstructured pruning at high sparsity for generative tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 70.28 (STUN, 40% sparsity) | 70.74 (unpruned) | -0.46 | GSM8K | STUN keeps near-original GSM8K accuracy at 40% sparsity | Table 2 |
| Accuracy | 43.97 (STUN, 65% sparsity) | 13.42 (OWL, 65% sparsity) | +30.55 | GSM8K | STUN far outperforms unstructured pruning at high sparsity | Table 2 |
What To Try In 7 Days
Run STUN expert pruning on one checkpoint of your MoE (use router rows for clustering) to test 10–40% expert removal.
After expert pruning, run your current unstructured pruner (Wanda/OWL) on the survivors and compare end‑to‑end quality vs baseline.
Measure runtime and GPU count: verify single‑GPU pruning and evaluate cost savings for your serving fleet.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Final speedups depend on hardware support for unstructured sparsity; GPUs may not see full runtime gains today.
Evaluations use a fixed set of tasks and C4 calibration; behavior under heavily skewed or domain-shifted data is not fully characterized.
When Not To Use
If your target inference hardware cannot accelerate unstructured sparsity, end‑to‑end latency may not improve.
If you require strict formal guarantees on worst‑case behavior under distribution shift without further validation.
Failure Modes
At extreme sparsity choices, unstructured pruning may still collapse generation quality if not tuned.
If router weights do not reflect true expert behavior for your data, clustering may remove useful experts.

