Drop MoE layers or blocks + quantize experts to cut memory and run time with small accuracy loss

June 4, 20248 min

Overview

Decision SnapshotNeeds Validation

The experiments use two public MoE models and standard benchmarks; numbers are reproducible with the provided code but are measured on specific models, sequence lengths, and calibration data.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 40%

Authors

Shwai He, Daize Dong, Liang Ding, Ang Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Coarse structural compression plus 4-bit quantization can cut inference cost and memory enough to run large MoE models on cheaper GPUs while losing only a small fraction of task accuracy.

Who Should Care

Summary TLDR

The paper studies ways to compress Mixture-of-Experts (MoE) language models. It finds that coarse-grained removal of entire MoE layers (Layer Drop) or whole transformer blocks (Block Drop) gives much larger speed and memory gains than removing individual experts (Expert Drop). Combining Block Drop with 4-bit quantization (AWQ) yields a 6.05× decoding speedup and reduces memory to ~20 GB while keeping over 92% of performance on Mixtral-8×7B. Expert Slimming (quantization, pruning) helps further; short post-finetuning can recover most lost accuracy.

Problem Statement

MoE models keep many expert copies and require cross-expert communication. This raises GPU memory and inference latency and makes deployment costly. We need compression methods that cut memory and compute while keeping most model quality.

Main Contribution

Extend expert-level pruning to coarse-grained surgical removal: Layer Drop (remove MoE layers) and Block Drop (remove full transformer blocks).

Show that Layer/Block Drop often preserves quality better than Expert Drop while cutting compute and communication.

Key Findings

Block Drop + 4-bit quantization produces major runtime and memory reductions while keeping most accuracy.

Numbers6.05× speedup; memory 20.0GB; >92% performance (Mixtral-8×7B)

Practical UseIf you need to deploy Mixtral-scale MoE on a single GPU, apply Block Drop then 4-bit AWQ quantization to get multi× speedup and fit on ~20GB GPUs at small quality cost.

Evidence RefTable 3, Section 7

Dropping a fraction of experts (Expert Drop) often hurts accuracy a lot but gives little speedup.

Numbers25% experts dropped → 23% MMLU drop; 12.5% drop → <1% speedup

Practical UseAvoid relying solely on expert-level pruning for latency; it saves memory but not inference time and can sharply reduce task accuracy.

Evidence RefSection 5, Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Decoding speedup (combined recipe)6.05×uncompressed Mixtral-8×7B×6.05LM-harness avg; inference length 2048Block Drop + AWQ reduced runtime by 6.05×Table 3, Section 7
Memory footprint (combined recipe)20.0 GB87.7 GB−77.1%Mixtral-8×7B forward pass (seq len 2048)Memory reduced to 20.0GB with Block Drop + AWQTable 3

What To Try In 7 Days

Measure layer/block hidden-state similarity with ~128 C4 samples to find redundant modules.

Apply 4-bit AWQ quantization to your MoE model and check memory and accuracy on a small benchmark.

Experiment with removing a few blocks (Block Drop) and validate on critical tasks; keep a rollback plan.

Optimization Features

Infra Optimization
fit model on single 20–24GB GPU after compression
Model Optimization
Layer DropBlock DropExpert DropExpert Slimming
System Optimization
reduce cross-expert communication by removing MoE layers/blocks
Training Optimization
post-finetuning (few epochs)
Inference Optimization
4-bit quantization (AWQ/GPTQ)drop entire transformer blocks to reduce attention and KV cache

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

C4 (calibration/evaluation)Alpaca (finetuning samples)Pile (quantization calibration)

Risks & Boundaries

Limitations

Evaluations limited to Mixtral-8×7B and DeepSeek-MoE-16B; results may vary for different MoE designs.

Similarity scores use 128 samples and feature choices; decisions could shift for other data distributions.

When Not To Use

When you cannot afford any accuracy loss and cannot perform post-finetuning.

When model architecture or serving pipeline cannot remove attention/KV-cache without breaking runtime logic.

Failure Modes

Routing collapse or degraded expert selection after partial Expert Drop.

Semi-structured pruning causing large accuracy drops and no practical speedups.

Core Entities

Models

Mixtral-8×7BDeepSeek-MoE-16BMistral-7B

Metrics

decoding speedup (×)memory (GB)FLOPs (T)Accuracy

Datasets

C4AlpacaPileLM-harness benchmark (ARC-C, BoolQ, HellaSwag, MMLU, OBQA, PIQA, RTE, WinoGrande)

Benchmarks

ARC-CBoolQHellaSwagMMLUOBQAPIQARTEWinoGrande