Cut expert count in SMoE models up to 75% using gradient-free pruning plus weight merging

July 1, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Links

Abstract / PDF

Why It Matters For Business

EEP lowers GPU memory and inference cost for SMoE LLMs and can improve accuracy on specific downstream tasks, making large MoE models more affordable to deploy.

Summary TLDR

The paper introduces EEP, a gradient-free evolutionary search that prunes and then merges experts inside sparse Mixture-of-Experts (SMoE) language models. EEP finds which experts to remove (pruning) and uses weight merging (a kind of parameter averaging) to transfer knowledge to the survivors. On Mixtral and Qwen MoE models, EEP cuts expert counts (e.g., 8->2) and active experts (top-2->top-1) while keeping or improving downstream accuracy, reducing GPU memory by up to ~71% and boosting inference speed up to ~1.63× in some settings. The method runs with inference-only passes and publicly available code.

Problem Statement

SMoE models activate a small subset of experts per token but still carry very large total parameter counts and GPU memory needs. Existing expert-pruning methods either lose accuracy, need heavy fine-tuning (large GPU cost), or cannot aggressively cut experts. The practical problem is how to reduce both model memory and runtime cost for SMoE models while keeping or improving downstream task accuracy, using only inference-capable hardware.

Main Contribution

EEP: a gradient-free evolutionary procedure that searches discrete pruning patterns and continuous merging weights for SMoE expert pruning and merging.

Expert merging via a learnable merging matrix to transfer knowledge from removed experts into retained ones without backpropagation.

Empirical results on Mixtral and Qwen MoE models showing strong accuracy retention or improvement after aggressive pruning, plus profiling of GPU memory and inference speed gains.

A use-case split: (1) reduce total experts to lower memory, (2) reduce active experts per token to speed inference, and (3) combine both.

Key Findings

EEP can cut the total experts from 8 to 2 (72% parameter drop) while keeping comparable task performance.

Numbers72% parameter reduction (8→2)

On SQuAD, pruning plus merging raised accuracy from the full-model 53.4% to 80.6% in Mixtral 8×7B-Instruct.

NumbersSQuAD: 53.4% → 75.2% (Prune Only) → 80.6% (Prune+Merge)

Reducing experts cuts GPU memory and speeds inference: keeping 4 experts saved 47% GPU memory and a combined setting achieved 1.41× overall speedup.

Numbers47% GPU mem reduction; 1.41× speedup (4 total, 1 active)

EEP outperforms common baselines (random, frequency, soft-activation, NAEE) on average accuracy after pruning.

NumbersMixtral 8×7B Num=4 Avg: EEP Prune+Merge 74.2 vs Full 62.4 (+11.8)

EEP is gradient-free but involves a multi-stage evolutionary search; authors used 40 pruning iterations + 160 merging iterations in experiments.

Numbers40 (prune) + 160 (merge) search iterations

Results

Accuracy

ValueFull 53.4% → EEP Prune Only 75.2% → EEP Prune+Merge 80.6%

BaselineFull model

Accuracy

ValueFull 62.4% → EEP (Num=4, Prune+Merge) 74.2%

BaselineFull model

Parameter / expert reduction

Value8→2 experts (per block)

Baseline8 experts

GPU memory (Mixtral 8×7B, SQuAD batch=256 on 2xA100)

ValueFull 88.6 GB → Num=4 46.6 GB → Num=2 25.6 GB

BaselineFull model

Inference speed (prefill/overall)

ValueUp to 1.63× prefill speed (8→1 active, BS=256); combined 1.41× when 4 total + 1 active

BaselineFull model

Who Should Care

What To Try In 7 Days

Run EEP prune-only (40 iters) on one downstream task to find which experts can be removed.

Add the merge phase (160 iters) and compare accuracy to full model on a validation set.

Profile memory and latency with 4→2→1 expert budgets to find a practical cost/accuracy point.

Agent Features

Memory

  • Reduces expert parameters and peak GPU memory

Tool Use

  • Evolutionary search
  • Weight merging (expert merging)

Frameworks

  • OpenCompass (eval)
  • EEP code (GitHub)

Architectures

  • SMoE

Optimization Features

Token Efficiency

  • Fewer active experts lowers per-token compute and can speed prefill

Infra Optimization

  • Better parallelism after pruning increases throughput

Model Optimization

  • Structured expert pruning
  • Expert weight merging

System Optimization

  • Reduce GPU memory to fit larger models on fewer devices

Training Optimization

  • Gradient-free tuning via evolutionary strategy
  • Inference-only evaluation during search

Inference Optimization

  • Reduce active experts per token (top-2 → top-1)
  • Lower memory pressure → larger batch sizes

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Search requires many inference evaluations (authors used 40+160 iterations), which can be costly.
  • Performance gains are task- and model-dependent; not every task improves after pruning.
  • EEP optimizes with calibration/training subsets; results depend on representativeness of that data.

When Not To Use

  • On non-MoE (dense) models.
  • When you cannot afford the evolutionary search budget for calibration.
  • Where worst-case end-to-end robustness must be preserved without any tuning.

Failure Modes

  • Poor calibration data leads to suboptimal pruning patterns and collapse.
  • Random pruning is unstable under high sparsity (high variance as shown in random baselines).
  • Negative merging coefficients or bad merges can harm specific tasks.

Core Entities

Models

  • Mixtral 8×7B-Instruct
  • Mixtral 8×22B-Instruct
  • Qwen1.5-MoE-A2.7B-Chat
  • Qwen2-MoE-A14B-Chat

Metrics

  • Accuracy
  • GPU memory (GB)
  • inference speedup (×)
  • avg. active experts per token

Datasets

  • SQuAD
  • DROP
  • SuperGLUE subsets
  • MMLU

Benchmarks

  • SuperGLUE
  • MMLU
  • SQuAD

Context Entities

Models

  • Grok-1
  • DBRX