Drop MoE layers or blocks + quantize experts to cut memory and run time with small accuracy loss

June 4, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

3

Authors

Shwai He, Daize Dong, Liang Ding, Ang Li

Links

Abstract / PDF

Why It Matters For Business

Coarse structural compression plus 4-bit quantization can cut inference cost and memory enough to run large MoE models on cheaper GPUs while losing only a small fraction of task accuracy.

Summary TLDR

The paper studies ways to compress Mixture-of-Experts (MoE) language models. It finds that coarse-grained removal of entire MoE layers (Layer Drop) or whole transformer blocks (Block Drop) gives much larger speed and memory gains than removing individual experts (Expert Drop). Combining Block Drop with 4-bit quantization (AWQ) yields a 6.05× decoding speedup and reduces memory to ~20 GB while keeping over 92% of performance on Mixtral-8×7B. Expert Slimming (quantization, pruning) helps further; short post-finetuning can recover most lost accuracy.

Problem Statement

MoE models keep many expert copies and require cross-expert communication. This raises GPU memory and inference latency and makes deployment costly. We need compression methods that cut memory and compute while keeping most model quality.

Main Contribution

Extend expert-level pruning to coarse-grained surgical removal: Layer Drop (remove MoE layers) and Block Drop (remove full transformer blocks).

Show that Layer/Block Drop often preserves quality better than Expert Drop while cutting compute and communication.

Integrate Expert Slimming (quantization/pruning) with Expert Trimming to achieve large speedups and memory reductions.

Provide empirical results on Mixtral-8×7B and DeepSeek-MoE-16B and release code.

Key Findings

Block Drop + 4-bit quantization produces major runtime and memory reductions while keeping most accuracy.

Numbers6.05× speedup; memory 20.0GB; >92% performance (Mixtral-8×7B)

Dropping a fraction of experts (Expert Drop) often hurts accuracy a lot but gives little speedup.

Numbers25% experts dropped → 23% MMLU drop; 12.5% drop → <1% speedup

4-bit post-training quantization preserves accuracy well and cuts memory significantly.

NumbersQuantized models keep >98% perf and reduce memory to <30% (Mixtral: 87.7GB→24.4GB); AWQ speedups: Mixtral ×5.08, DeepSee

Short post-finetuning largely recovers accuracy lost to coarse compression.

NumbersBlock Drop gap reduced from 5.5% → 0.6% after 3-epoch finetune (DeepSeek-MoE-16B)

Results

Decoding speedup (combined recipe)

Value6.05×

Baselineuncompressed Mixtral-8×7B

Memory footprint (combined recipe)

Value20.0 GB

Baseline87.7 GB

Relative performance retained

Value>92%

Baseline100% (original Mixtral-8×7B)

Quantization speedup (AWQ 4-bit)

Value×5.08 (Mixtral), ×3.16 (DeepSeek)

Baseline16-bit model

Expert Drop impact

Value25% experts removed → −23% MMLU

Baselineno experts removed

Post-finetuning recovery

Valuegap reduced to 0.6%

Baselinecompressed DeepSeek before finetune

Who Should Care

What To Try In 7 Days

Measure layer/block hidden-state similarity with ~128 C4 samples to find redundant modules.

Apply 4-bit AWQ quantization to your MoE model and check memory and accuracy on a small benchmark.

Experiment with removing a few blocks (Block Drop) and validate on critical tasks; keep a rollback plan.

Optimization Features

Infra Optimization

  • fit model on single 20–24GB GPU after compression

Model Optimization

  • Layer Drop
  • Block Drop
  • Expert Drop
  • Expert Slimming

System Optimization

  • reduce cross-expert communication by removing MoE layers/blocks

Training Optimization

  • post-finetuning (few epochs)

Inference Optimization

  • 4-bit quantization (AWQ/GPTQ)
  • drop entire transformer blocks to reduce attention and KV cache

Reproducibility

Data Urls

  • C4 (calibration/evaluation)
  • Alpaca (finetuning samples)
  • Pile (quantization calibration)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations limited to Mixtral-8×7B and DeepSeek-MoE-16B; results may vary for different MoE designs.
  • Similarity scores use 128 samples and feature choices; decisions could shift for other data distributions.
  • Speedups reported for decoding with seq len 2048 and specific quantizers/hardware; numbers may differ on other setups.

When Not To Use

  • When you cannot afford any accuracy loss and cannot perform post-finetuning.
  • When model architecture or serving pipeline cannot remove attention/KV-cache without breaking runtime logic.
  • When hardware cannot run low-bit quantized kernels efficiently.

Failure Modes

  • Routing collapse or degraded expert selection after partial Expert Drop.
  • Semi-structured pruning causing large accuracy drops and no practical speedups.
  • Over-aggressive Block Drop removing layers required for specific tasks or domains.

Core Entities

Models

  • Mixtral-8×7B
  • DeepSeek-MoE-16B
  • Mistral-7B

Metrics

  • decoding speedup (×)
  • memory (GB)
  • FLOPs (T)
  • Accuracy

Datasets

  • C4
  • Alpaca
  • Pile
  • LM-harness benchmark (ARC-C, BoolQ, HellaSwag, MMLU, OBQA, PIQA, RTE, WinoGrande)

Benchmarks

  • ARC-C
  • BoolQ
  • HellaSwag
  • MMLU
  • OBQA
  • PIQA
  • RTE
  • WinoGrande