Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.8
Citation Count
3
Why It Matters For Business
Coarse structural compression plus 4-bit quantization can cut inference cost and memory enough to run large MoE models on cheaper GPUs while losing only a small fraction of task accuracy.
Summary TLDR
The paper studies ways to compress Mixture-of-Experts (MoE) language models. It finds that coarse-grained removal of entire MoE layers (Layer Drop) or whole transformer blocks (Block Drop) gives much larger speed and memory gains than removing individual experts (Expert Drop). Combining Block Drop with 4-bit quantization (AWQ) yields a 6.05× decoding speedup and reduces memory to ~20 GB while keeping over 92% of performance on Mixtral-8×7B. Expert Slimming (quantization, pruning) helps further; short post-finetuning can recover most lost accuracy.
Problem Statement
MoE models keep many expert copies and require cross-expert communication. This raises GPU memory and inference latency and makes deployment costly. We need compression methods that cut memory and compute while keeping most model quality.
Main Contribution
Extend expert-level pruning to coarse-grained surgical removal: Layer Drop (remove MoE layers) and Block Drop (remove full transformer blocks).
Show that Layer/Block Drop often preserves quality better than Expert Drop while cutting compute and communication.
Integrate Expert Slimming (quantization/pruning) with Expert Trimming to achieve large speedups and memory reductions.
Provide empirical results on Mixtral-8×7B and DeepSeek-MoE-16B and release code.
Key Findings
Block Drop + 4-bit quantization produces major runtime and memory reductions while keeping most accuracy.
Dropping a fraction of experts (Expert Drop) often hurts accuracy a lot but gives little speedup.
4-bit post-training quantization preserves accuracy well and cuts memory significantly.
Short post-finetuning largely recovers accuracy lost to coarse compression.
Results
Decoding speedup (combined recipe)
Memory footprint (combined recipe)
Relative performance retained
Quantization speedup (AWQ 4-bit)
Expert Drop impact
Post-finetuning recovery
Who Should Care
What To Try In 7 Days
Measure layer/block hidden-state similarity with ~128 C4 samples to find redundant modules.
Apply 4-bit AWQ quantization to your MoE model and check memory and accuracy on a small benchmark.
Experiment with removing a few blocks (Block Drop) and validate on critical tasks; keep a rollback plan.
Optimization Features
Infra Optimization
- fit model on single 20–24GB GPU after compression
Model Optimization
- Layer Drop
- Block Drop
- Expert Drop
- Expert Slimming
System Optimization
- reduce cross-expert communication by removing MoE layers/blocks
Training Optimization
- post-finetuning (few epochs)
Inference Optimization
- 4-bit quantization (AWQ/GPTQ)
- drop entire transformer blocks to reduce attention and KV cache
Reproducibility
Data Urls
- C4 (calibration/evaluation)
- Alpaca (finetuning samples)
- Pile (quantization calibration)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations limited to Mixtral-8×7B and DeepSeek-MoE-16B; results may vary for different MoE designs.
- Similarity scores use 128 samples and feature choices; decisions could shift for other data distributions.
- Speedups reported for decoding with seq len 2048 and specific quantizers/hardware; numbers may differ on other setups.
When Not To Use
- When you cannot afford any accuracy loss and cannot perform post-finetuning.
- When model architecture or serving pipeline cannot remove attention/KV-cache without breaking runtime logic.
- When hardware cannot run low-bit quantized kernels efficiently.
Failure Modes
- Routing collapse or degraded expert selection after partial Expert Drop.
- Semi-structured pruning causing large accuracy drops and no practical speedups.
- Over-aggressive Block Drop removing layers required for specific tasks or domains.
Core Entities
Models
- Mixtral-8×7B
- DeepSeek-MoE-16B
- Mistral-7B
Metrics
- decoding speedup (×)
- memory (GB)
- FLOPs (T)
- Accuracy
Datasets
- C4
- Alpaca
- Pile
- LM-harness benchmark (ARC-C, BoolQ, HellaSwag, MMLU, OBQA, PIQA, RTE, WinoGrande)
Benchmarks
- ARC-C
- BoolQ
- HellaSwag
- MMLU
- OBQA
- PIQA
- RTE
- WinoGrande

