Prune MoEs first at expert level, then fine-grain inside experts — fast, single‑GPU, and better than either pruning alone.

September 10, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He

Links

Abstract / PDF

Why It Matters For Business

STUN cuts serving memory and GPUs for large MoE models while keeping generation quality, enabling cheaper deployment of very large sparse models without costly retraining.

Summary TLDR

STUN is a two‑phase pruning recipe for Mixture‑of‑Experts (MoE) models: (1) fast, scalable expert-level (structured) pruning using clustering and Taylor approximations that reduces GPU forward calls to O(1), then (2) standard unstructured pruning inside remaining experts. Empirically on Snowflake Arctic (480B, 128 experts) and Mixtral models, STUN achieves much higher retained accuracy at high sparsity (e.g., Arctic: 40% sparsity with near‑no loss on GSM8K) while needing only one H100 for ~1–2 hours. Code is public.

Problem Statement

MoEs cut compute by activating a few experts, but they still hold huge parameter counts and demand large serving memory. Existing structured expert pruning either scales poorly (exhaustive combinatorics) or underperforms; unstructured pruning can be effective but fails at high sparsity or on generative tasks. We need a pruning method that scales to hundreds of experts and keeps quality at high sparsity without costly GPU work.

Main Contribution

STUN: a two‑phase pruning pipeline — expert-level structured pruning first, then unstructured pruning inside survivors.

Scalable expert pruning that approximates combinatorial search to achieve O(1) GPU forward calls using router‑weight clustering and 1st‑order Taylor approximations.

Empirical proof that expert pruning preserves or increases intra‑expert robustness (kurtosis), so following unstructured pruning is effective.

Practical recipe: one H100 and ~2 hours to prune Snowflake Arctic to ~40% sparsity with near‑no performance loss; code released.

Key Findings

STUN retains GSM8K generation accuracy at 40% sparsity on Snowflake Arctic with almost no loss.

NumbersArctic GSM8K: unpruned 70.74 → STUN (40%) 70.28 (Table 2)

STUN substantially outperforms unstructured pruning at high sparsity for generative tasks.

NumbersArctic 65% sparsity: STUN GSM8K 43.97 vs OWL 13.42 (Δ ≈ +30.6 points) (Table 2)

Expert pruning phase reduced the pruning compute to O(1) GPU calls and is practical on one GPU.

NumbersSTUN expert pruning runs in ~0.58–1.12 hours on 1 H100 vs combinatorial method labeled infeasible (Table 7).

Expert pruning preserves or increases weight kurtosis, which keeps models robust to later unstructured pruning.

NumbersReported kurtosis increased from 14248 → 15623 after expert pruning (Section 4.3 / Sec 5.1)

STUN works across MoE and some non‑MoE models and scales better when models have many small experts.

NumbersMixtral-8x7B and 8x22B show consistent gains; gap widens as expert count increases (Figure 3, Table 2).

Results

Accuracy

Value70.28 (STUN, 40% sparsity)

Baseline70.74 (unpruned)

Accuracy

Value43.97 (STUN, 65% sparsity)

Baseline13.42 (OWL, 65% sparsity)

Accuracy

Value25.09 (STUN, 65% sparsity)

Baseline1.29 (OWL or LLM-Pruner, 65% sparsity)

Pruning runtime (Snowflake Arctic expert pruning)

Value≈1.12h (STUN w/ OWL) or 0.58h (STUN w/ Wanda) on 1 H100

BaselineInfeasible / >8 GPUs for Lu et al. (2024a) combinatorial method

Who Should Care

What To Try In 7 Days

Run STUN expert pruning on one checkpoint of your MoE (use router rows for clustering) to test 10–40% expert removal.

After expert pruning, run your current unstructured pruner (Wanda/OWL) on the survivors and compare end‑to‑end quality vs baseline.

Measure runtime and GPU count: verify single‑GPU pruning and evaluate cost savings for your serving fleet.

Optimization Features

Infra Optimization

  • single‑GPU pruning (H100) in ~0.6–1.1 hours for Arctic
  • reduces number of GPUs required for pruning compared to combinatorial search

Model Optimization

  • expert-level structured pruning
  • structured-then-unstructured pruning (STUN)
  • clustering of experts from router weights
  • 1st-order Taylor approximation for selective reconstruction

System Optimization

  • no backpropagation or fine‑tuning required
  • compatible with existing unstructured pruners (Wanda, OWL)

Inference Optimization

  • reduces GPU forward calls to O(1) for expert search
  • enables pruning on a single H100
  • preserves model quality at high sparsity so fewer GPUs needed at serving

Reproducibility

Data Urls

  • C4 dataset (used for calibration)
  • LM-Evaluation-Harness (evaluation)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Final speedups depend on hardware support for unstructured sparsity; GPUs may not see full runtime gains today.
  • Evaluations use a fixed set of tasks and C4 calibration; behavior under heavily skewed or domain-shifted data is not fully characterized.
  • Second phase relies on existing unstructured pruners; strengths inherit their limitations.

When Not To Use

  • If your target inference hardware cannot accelerate unstructured sparsity, end‑to‑end latency may not improve.
  • If you require strict formal guarantees on worst‑case behavior under distribution shift without further validation.

Failure Modes

  • At extreme sparsity choices, unstructured pruning may still collapse generation quality if not tuned.
  • If router weights do not reflect true expert behavior for your data, clustering may remove useful experts.
  • Sparse acceleration is hardware and library dependent; storage reduction may not translate to latency reduction.

Core Entities

Models

  • Snowflake Arctic (480B, 128 experts)
  • Mixtral-8x7B
  • Mixtral-8x22B

Metrics

  • Accuracy
  • LM-eval average

Datasets

  • GSM8K
  • ARC-challenge
  • ARC-easy
  • HellaSwag
  • MMLU
  • BoolQ
  • OpenBookQA
  • RTE
  • WinoGrande
  • C4

Benchmarks

  • LM-Evaluation-Harness