Overview
Method needs only a precomputed similarity matrix and a CUDA kernel to yield practical latency gains for batched MoE decoding; experiments across three MoE models support its effectiveness.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
SERE lowers batched decoding latency and inference cost for MoE-based LLMs with minimal accuracy loss and with a one-line vLLM integration, making it practical to reduce serving bills and improve responsiveness.
Who Should Care
Summary TLDR
SERE is a runtime method for Mixture-of-Experts (MoE) models that reduces the number of active experts during batched decoding by re-routing tokens from low-impact experts to their most similar retained experts. The similarity matrix is precomputed on a small calibration set (no retraining). With a CUDA kernel and a one-line integration into vLLM, SERE cuts decoding latency up to 2.0× in experiments while keeping most model quality. It is best for latency- and cost-sensitive MoE serving; it does not reduce FLOPs and gives little benefit in compute-bound prefill stages.
Problem Statement
MoE models are efficient per token but batched serving activates many different experts across requests, inflating memory access and communication during decoding and eroding real-world latency gains. The challenge: reduce active experts at decode time without retraining or damaging model capability.
Main Contribution
SERE: a similarity-based, dynamic re‑routing method that redirects tokens from secondary experts to the most similar primary experts at decode time, preserving critical experts via a similarity threshold.
A precompute-and-reuse pipeline: compute per-layer expert similarity matrices on a small calibration set and use them at runtime—no model retraining required.
Key Findings
SERE reduces decoding latency substantially in batched MoE serving.
Quality loss is small under typical settings (Top-2 primary experts).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Max reported speedup | 2.0× | baseline MoE decode | 2.0× faster | Qwen1.5 at QPS=24 (reported in Sec.4.3/Fig.6) | Sec.4.3: reports up to 2.0× speedup under evaluated QPS | — |
| TPOT (Qwen3-30B-A3B) | 32.12 ms | 44.40 ms (Qwen3 top-8 baseline) | ≈1.38× speedup | OpenCompass accuracy/TPOT (QPS=16), Table3 | Table3: baseline 44.40→SERE top2 32.12 ms | — |
What To Try In 7 Days
Run a small calibration pass (e.g., 400×128 FineWeb‑Edu) and compute Frobenius similarity matrices.
Integrate the SERE CUDA kernel into your vLLM deployment (single-line change from authors' repo).
Start with Top-2 primary experts and tune similarity threshold ρ to balance TPOT vs. accuracy on your key tasks.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
SERE speeds memory-bound decoding but does not reduce computation FLOPs; prefill (compute-bound) stage sees little benefit.
Effectiveness varies with MoE architecture: models with few specialized experts are more sensitive to aggressive skipping.
When Not To Use
When your bottleneck is compute-bound prefill rather than memory-bound decoding.
If your model has very few or highly specialized experts and cannot tolerate any expert substitution.
Failure Modes
Over-aggressive skipping causes large accuracy drops on reasoning and code tasks.
Incorrect similarity estimates (poor calibration) can re-route tokens to unsuitable experts.

