Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
SERE lowers batched decoding latency and inference cost for MoE-based LLMs with minimal accuracy loss and with a one-line vLLM integration, making it practical to reduce serving bills and improve responsiveness.
Summary TLDR
SERE is a runtime method for Mixture-of-Experts (MoE) models that reduces the number of active experts during batched decoding by re-routing tokens from low-impact experts to their most similar retained experts. The similarity matrix is precomputed on a small calibration set (no retraining). With a CUDA kernel and a one-line integration into vLLM, SERE cuts decoding latency up to 2.0× in experiments while keeping most model quality. It is best for latency- and cost-sensitive MoE serving; it does not reduce FLOPs and gives little benefit in compute-bound prefill stages.
Problem Statement
MoE models are efficient per token but batched serving activates many different experts across requests, inflating memory access and communication during decoding and eroding real-world latency gains. The challenge: reduce active experts at decode time without retraining or damaging model capability.
Main Contribution
SERE: a similarity-based, dynamic re‑routing method that redirects tokens from secondary experts to the most similar primary experts at decode time, preserving critical experts via a similarity threshold.
A precompute-and-reuse pipeline: compute per-layer expert similarity matrices on a small calibration set and use them at runtime—no model retraining required.
A high-performance CUDA kernel and plug-and-play integration with vLLM (single-line change) to realize practical speedups in production.
Key Findings
SERE reduces decoding latency substantially in batched MoE serving.
Quality loss is small under typical settings (Top-2 primary experts).
SERE is robust to calibration choices and similarity metrics and is cheap to compute.
A CUDA kernel makes SERE practical and faster than a PyTorch version.
Results
Max reported speedup
TPOT (Qwen3-30B-A3B)
Accuracy
Who Should Care
What To Try In 7 Days
Run a small calibration pass (e.g., 400×128 FineWeb‑Edu) and compute Frobenius similarity matrices.
Integrate the SERE CUDA kernel into your vLLM deployment (single-line change from authors' repo).
Start with Top-2 primary experts and tune similarity threshold ρ to balance TPOT vs. accuracy on your key tasks.
Optimization Features
Token Efficiency
- batch-aware reduction of active experts
Infra Optimization
- reduces memory-access and communication during decoding
Model Optimization
- preserve model weights (no retraining)
- preserve critical experts at runtime
System Optimization
- custom CUDA kernel for re-routing
- plug-and-play vLLM integration
Inference Optimization
- dynamic expert skipping
- similarity-based re-routing
- primary-expert selection (Top-S union)
- precomputed activation-based similarity matrices
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- SERE speeds memory-bound decoding but does not reduce computation FLOPs; prefill (compute-bound) stage sees little benefit.
- Effectiveness varies with MoE architecture: models with few specialized experts are more sensitive to aggressive skipping.
- Requires a calibration dataset to compute activation-based similarity; calibration choices and data volume affect trade-offs.
- High skipping rates or low similarity thresholds can degrade reasoning-heavy tasks (math/code).
When Not To Use
- When your bottleneck is compute-bound prefill rather than memory-bound decoding.
- If your model has very few or highly specialized experts and cannot tolerate any expert substitution.
- If you cannot run the required calibration pass or install a custom CUDA kernel in production.
Failure Modes
- Over-aggressive skipping causes large accuracy drops on reasoning and code tasks.
- Incorrect similarity estimates (poor calibration) can re-route tokens to unsuitable experts.
- Mismatch between prefill and decode expert sets (inconsistent selection) can introduce distribution shift and reduce quality.
Core Entities
Models
- Qwen1.5-MoE-A2.7B-Chat
- DeepSeekV2-Lite
- Qwen3-30B-A3B
Metrics
- Time per Output Token (TPOT)
- Accuracy
Datasets
- FineWeb-Edu
- C4
- WIKI
- OpenCompass (benchmark suite)
Benchmarks
- CMMLU
- BoolQ
- BBH
- Math
- GSM8K
- Math401
- HumanEval
- MBPP

