Prune MoEs first at expert level, then fine-grain inside experts — fast, single‑GPU, and better than either pruning alone.

September 10, 20248 min

Overview

Decision SnapshotReady For Pilot

The method is practically useful: it runs on one GPU, is evaluated on large MoEs (including a 480B model), and shows consistent gains; caveats include hardware support for sparse inference and limited distributional tests.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

STUN cuts serving memory and GPUs for large MoE models while keeping generation quality, enabling cheaper deployment of very large sparse models without costly retraining.

Who Should Care

Summary TLDR

STUN is a two‑phase pruning recipe for Mixture‑of‑Experts (MoE) models: (1) fast, scalable expert-level (structured) pruning using clustering and Taylor approximations that reduces GPU forward calls to O(1), then (2) standard unstructured pruning inside remaining experts. Empirically on Snowflake Arctic (480B, 128 experts) and Mixtral models, STUN achieves much higher retained accuracy at high sparsity (e.g., Arctic: 40% sparsity with near‑no loss on GSM8K) while needing only one H100 for ~1–2 hours. Code is public.

Problem Statement

MoEs cut compute by activating a few experts, but they still hold huge parameter counts and demand large serving memory. Existing structured expert pruning either scales poorly (exhaustive combinatorics) or underperforms; unstructured pruning can be effective but fails at high sparsity or on generative tasks. We need a pruning method that scales to hundreds of experts and keeps quality at high sparsity without costly GPU work.

Main Contribution

STUN: a two‑phase pruning pipeline — expert-level structured pruning first, then unstructured pruning inside survivors.

Scalable expert pruning that approximates combinatorial search to achieve O(1) GPU forward calls using router‑weight clustering and 1st‑order Taylor approximations.

Key Findings

STUN retains GSM8K generation accuracy at 40% sparsity on Snowflake Arctic with almost no loss.

NumbersArctic GSM8K: unpruned 70.74 → STUN (40%) 70.28 (Table 2)

Practical UseYou can compress a 480B MoE by ~40% and still run generation tasks without retraining; deploy on fewer GPUs.

Evidence RefTable 2

STUN substantially outperforms unstructured pruning at high sparsity for generative tasks.

NumbersArctic 65% sparsity: STUN GSM8K 43.97 vs OWL 13.42≈ +30.6 points) (Table 2)

Practical UseFor aggressive compression (>50%), prefer STUN over pure unstructured methods to avoid catastrophic drop in generation quality.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy70.28 (STUN, 40% sparsity)70.74 (unpruned)-0.46GSM8KSTUN keeps near-original GSM8K accuracy at 40% sparsityTable 2
Accuracy43.97 (STUN, 65% sparsity)13.42 (OWL, 65% sparsity)+30.55GSM8KSTUN far outperforms unstructured pruning at high sparsityTable 2

What To Try In 7 Days

Run STUN expert pruning on one checkpoint of your MoE (use router rows for clustering) to test 10–40% expert removal.

After expert pruning, run your current unstructured pruner (Wanda/OWL) on the survivors and compare end‑to‑end quality vs baseline.

Measure runtime and GPU count: verify single‑GPU pruning and evaluate cost savings for your serving fleet.

Optimization Features

Infra Optimization
single‑GPU pruning (H100) in ~0.6–1.1 hours for Arcticreduces number of GPUs required for pruning compared to combinatorial search
Model Optimization
expert-level structured pruningstructured-then-unstructured pruning (STUN)clustering of experts from router weights1st-order Taylor approximation for selective reconstruction
System Optimization
no backpropagation or fine‑tuning requiredcompatible with existing unstructured pruners (Wanda, OWL)
Inference Optimization
reduces GPU forward calls to O(1) for expert searchenables pruning on a single H100preserves model quality at high sparsity so fewer GPUs needed at serving

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

C4 dataset (used for calibration)LM-Evaluation-Harness (evaluation)

Risks & Boundaries

Limitations

Final speedups depend on hardware support for unstructured sparsity; GPUs may not see full runtime gains today.

Evaluations use a fixed set of tasks and C4 calibration; behavior under heavily skewed or domain-shifted data is not fully characterized.

When Not To Use

If your target inference hardware cannot accelerate unstructured sparsity, end‑to‑end latency may not improve.

If you require strict formal guarantees on worst‑case behavior under distribution shift without further validation.

Failure Modes

At extreme sparsity choices, unstructured pruning may still collapse generation quality if not tuned.

If router weights do not reflect true expert behavior for your data, clustering may remove useful experts.

Core Entities

Models

Snowflake Arctic (480B, 128 experts)Mixtral-8x7BMixtral-8x22B

Metrics

AccuracyLM-eval average

Datasets

GSM8KARC-challengeARC-easyHellaSwagMMLUBoolQOpenBookQARTEWinoGrandeC4

Benchmarks

LM-Evaluation-Harness