UNCURL: cluster-and-merge pruning for Mixture-of-Experts that cuts experts at inference while keeping task accuracy

September 2, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Soumajyoti Sarkar, Leonard Lausen, Volkan Cevher, Sheng Zha, Thomas Brox, George Karypis

Links

Abstract / PDF

Why It Matters For Business

If you plan to deploy SMoE models on multi-GPU setups, pretraining with many experts can improve accuracy but raises inference latency and memory costs; UNCURL gives a practical offline path to shrink experts for tasks while often keeping accuracy.

Summary TLDR

The paper studies when large sparse Mixture-of-Experts (SMoE) language models are worth pretraining if you must later reduce experts for inference. It shows naive expert-frequency pruning damages task accuracy. The authors introduce UNCURL, an offline per-layer clustering + permutation-aligned expert merging method, that often lets you prune large SMoEs by a factor of 2 (and sometimes 4) while keeping or improving SuperGLUE task accuracy versus an equivalent smaller SMoE trained from scratch. But pruning limits depend on pretrained expert count, and large expert counts still raise inference latency because of inter-GPU (All2All) traffic.

Problem Statement

Large SMoE models raise training capacity cheaply but force many experts into memory at inference, increasing inter-GPU communication and latency. Practitioners must choose how many experts to pretrain if downstream inference will be memory-constrained and whether post-training task-specific pruning can recover the benefits of larger SMoEs without retraining from scratch.

Main Contribution

Show controlled tradeoffs between # experts, pretraining benefits, and inference latency for 354M backbone SMoEs scaled to 1B–13B params (8–128 experts).

Demonstrate naïve one-shot pruning by expert activation frequency loses performance across SuperGLUE tasks.

Propose UNCURL: per-layer spectral clustering on router logits, permutation alignment, and activation-weighted expert merging to produce fewer experts offline.

Empirically show UNCURL can often prune larger SMoEs by ×2 (and in some cases ×4) while matching or exceeding equivalent smaller SMoEs on SuperGLUE.

Clarify pruning limits: heavy reductions on very large expert counts (e.g., 128→8) degrade performance.

Key Findings

Naïve one-shot pruning by expert activation frequency hurts performance across tasks.

NumbersPruned 354M+(32e→8e) lower than 354M+32e on many tasks (Table 1/2)

UNCURL (cluster-merge) often preserves or improves accuracy versus smaller scratch models after pruning.

Numbers354M+(32e→8e) outperforms 354M+8e (e.g., BoolQ +1.78 pts; RTE +0.45 pts) on evaluated tasks

There is a practical pruning threshold that depends on pretrained expert count.

NumbersPrune factor ≈2 works reliably for 64/128-expert models; factor ≈4 works for 32→8 in experiments

More experts increase inference latency due to All2All communication dominating compute.

NumbersInference latency increases monotonically with expert count; All2All dominates expert compute in profiling (Fig.2a, Fig.

Very aggressive pruning of extremely large SMoEs can fail.

Numbers354M+(128e→8e) underperforms smaller models on nearly all tasks (Table 2)

Results

Accuracy

Value69.13 (354M+(32e→8e) after UNCURL)

Baseline67.35 (354M+8e trained from scratch)

Accuracy

Value66.48 (354M+(32e→8e) after UNCURL)

Baseline66.03 (354M+8e trained from scratch)

Max observed task improvement vs dense

ValueBoolQ +7.3 pts; RTE +6.8 pts

Baseline354M dense

Inference latency trend

ValueMonotonic increase with experts (profiling across 8 A10 GPUs)

Baselinelower expert counts

Who Should Care

What To Try In 7 Days

Profile inference latency of your SMoE across expert counts to measure All2All cost.

Run UNCURL clustering on a pretrained SMoE checkpoint with a target pruning factor of 2 and finetune on task data.

Compare pruned model vs same-size model trained from scratch on a small validation set (accuracy and latency).

Agent Features

Memory

  • router logits used as pruning signal

Tool Use

  • DeepSpeed-MoE

Frameworks

  • UNCURL (cluster-merge)
  • Spectral clustering
  • k-means
  • Permutation alignment (Hungarian)

Architectures

  • MoE

Optimization Features

Token Efficiency

  • top-1 routing keeps per-token FLOPs equal to dense

Infra Optimization

  • distribute experts across multiple GPUs (expert parallelism)

Model Optimization

  • expert merging (activation-weighted average)
  • permutation alignment before merge

System Optimization

  • expert parallelism across GPUs
  • All2All communication dominates latency

Training Optimization

  • top-1 routing
  • load-balancing auxiliary loss

Inference Optimization

  • reduce per-layer expert count to lower All2All traffic
  • selective expert pruning per task

Reproducibility

Data Urls

  • CC100 (Common Crawl English)
  • mC4 (English)
  • FLAN
  • SuperGLUE

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Does not analyze expert specialization semantically or by domain.
  • UNCURL has O(Z^2·|T|) similarity cost and O(Z^3) clustering costs per layer, expensive for many experts.
  • Works offline and requires task data; not a drop-in live routing optimization.
  • Very aggressive pruning of very large expert counts (e.g., 128→8) degrades performance.

When Not To Use

  • When you lack labeled task data to cluster router logits.
  • When extreme memory constraints force pruning beyond the empirical safe ratios.
  • When merge-time compute cost is prohibitive for your workflow.

Failure Modes

  • Naïve frequency-based expert removal irrecoverably drops routed capacity.
  • Over-reduction (large pruning ratio) causes accuracy collapse.
  • Permutation alignment errors can cause poor merges if not solved correctly.

Core Entities

Models

  • 354M (dense GPT2 backbone)
  • 354M+8e
  • 354M+32e
  • 354M+64e
  • 354M+128e
  • Switch Transformer
  • GLAM

Metrics

  • Accuracy
  • Validation loss
  • Inference latency (wall-clock)

Datasets

  • CC100 (English)
  • mC4 (English)
  • FLAN (instruction data)
  • SuperGLUE (subset)

Benchmarks

  • SuperGLUE (validation subset used)

Context Entities

Models

  • MC-SMOE merging (Li et al. 2023)
  • ModuleFormer
  • Model soups / merging literature