Prune MoEs first at expert level, then fine-grain inside experts — fast, single‑GPU, and better than either pruning alone.

Overview

Decision SnapshotReady For Pilot

The method is practically useful: it runs on one GPU, is evaluated on large MoEs (including a 480B model), and shows consistent gains; caveats include hardware support for sparse inference and limited distributional tests.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Jaeseong Lee, seung-won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

STUN cuts serving memory and GPUs for large MoE models while keeping generation quality, enabling cheaper deployment of very large sparse models without costly retraining.

Who Should Care

CTO ML Engineer Data Scientist Engineering Lead

Summary TLDR

STUN is a two‑phase pruning recipe for Mixture‑of‑Experts (MoE) models: (1) fast, scalable expert-level (structured) pruning using clustering and Taylor approximations that reduces GPU forward calls to O(1), then (2) standard unstructured pruning inside remaining experts. Empirically on Snowflake Arctic (480B, 128 experts) and Mixtral models, STUN achieves much higher retained accuracy at high sparsity (e.g., Arctic: 40% sparsity with near‑no loss on GSM8K) while needing only one H100 for ~1–2 hours. Code is public.

Problem Statement

MoEs cut compute by activating a few experts, but they still hold huge parameter counts and demand large serving memory. Existing structured expert pruning either scales poorly (exhaustive combinatorics) or underperforms; unstructured pruning can be effective but fails at high sparsity or on generative tasks. We need a pruning method that scales to hundreds of experts and keeps quality at high sparsity without costly GPU work.

Main Contribution

STUN: a two‑phase pruning pipeline — expert-level structured pruning first, then unstructured pruning inside survivors.

Scalable expert pruning that approximates combinatorial search to achieve O(1) GPU forward calls using router‑weight clustering and 1st‑order Taylor approximations.

Key Findings

STUN retains GSM8K generation accuracy at 40% sparsity on Snowflake Arctic with almost no loss.

NumbersArctic GSM8K: unpruned 70.74 → STUN (40%) 70.28 (Table 2)

Practical UseYou can compress a 480B MoE by ~40% and still run generation tasks without retraining; deploy on fewer GPUs.

Evidence RefTable 2

STUN substantially outperforms unstructured pruning at high sparsity for generative tasks.

NumbersArctic 65% sparsity: STUN GSM8K 43.97 vs OWL 13.42 (Δ ≈ +30.6 points) (Table 2)

Practical UseFor aggressive compression (>50%), prefer STUN over pure unstructured methods to avoid catastrophic drop in generation quality.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	70.28 (STUN, 40% sparsity)	70.74 (unpruned)	-0.46	GSM8K	STUN keeps near-original GSM8K accuracy at 40% sparsity	Table 2
Accuracy	43.97 (STUN, 65% sparsity)	13.42 (OWL, 65% sparsity)	+30.55	GSM8K	STUN far outperforms unstructured pruning at high sparsity	Table 2

What To Try In 7 Days

Run STUN expert pruning on one checkpoint of your MoE (use router rows for clustering) to test 10–40% expert removal.

After expert pruning, run your current unstructured pruner (Wanda/OWL) on the survivors and compare end‑to‑end quality vs baseline.

Measure runtime and GPU count: verify single‑GPU pruning and evaluate cost savings for your serving fleet.

Optimization Features

Infra Optimization

single‑GPU pruning (H100) in ~0.6–1.1 hours for Arcticreduces number of GPUs required for pruning compared to combinatorial search

Model Optimization

expert-level structured pruningstructured-then-unstructured pruning (STUN)clustering of experts from router weights1st-order Taylor approximation for selective reconstruction

System Optimization

no backpropagation or fine‑tuning requiredcompatible with existing unstructured pruners (Wanda, OWL)

Inference Optimization

reduces GPU forward calls to O(1) for expert searchenables pruning on a single H100preserves model quality at high sparsity so fewer GPUs needed at serving

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/thnkinbtfly/STUN

Data URLs

C4 dataset (used for calibration)LM-Evaluation-Harness (evaluation)

Risks & Boundaries

Limitations

Final speedups depend on hardware support for unstructured sparsity; GPUs may not see full runtime gains today.

Evaluations use a fixed set of tasks and C4 calibration; behavior under heavily skewed or domain-shifted data is not fully characterized.

When Not To Use

If your target inference hardware cannot accelerate unstructured sparsity, end‑to‑end latency may not improve.

If you require strict formal guarantees on worst‑case behavior under distribution shift without further validation.

Failure Modes

At extreme sparsity choices, unstructured pruning may still collapse generation quality if not tuned.

If router weights do not reflect true expert behavior for your data, clustering may remove useful experts.

Core Entities

Models

Snowflake Arctic (480B, 128 experts)Mixtral-8x7BMixtral-8x22B

Metrics

AccuracyLM-eval average

Datasets

GSM8KARC-challengeARC-easyHellaSwagMMLUBoolQOpenBookQARTEWinoGrandeC4

Benchmarks

LM-Evaluation-Harness

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

STUN retains GSM8K generation accuracy at 40% sparsity on Snowflake Arctic with almost no loss.

STUN substantially outperforms unstructured pruning at high sparsity for generative tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding