Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

February 7, 20267 min

Overview

Decision SnapshotNeeds Validation

Method needs only a precomputed similarity matrix and a CUDA kernel to yield practical latency gains for batched MoE decoding; experiments across three MoE models support its effectiveness.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SERE lowers batched decoding latency and inference cost for MoE-based LLMs with minimal accuracy loss and with a one-line vLLM integration, making it practical to reduce serving bills and improve responsiveness.

Who Should Care

Summary TLDR

SERE is a runtime method for Mixture-of-Experts (MoE) models that reduces the number of active experts during batched decoding by re-routing tokens from low-impact experts to their most similar retained experts. The similarity matrix is precomputed on a small calibration set (no retraining). With a CUDA kernel and a one-line integration into vLLM, SERE cuts decoding latency up to 2.0× in experiments while keeping most model quality. It is best for latency- and cost-sensitive MoE serving; it does not reduce FLOPs and gives little benefit in compute-bound prefill stages.

Problem Statement

MoE models are efficient per token but batched serving activates many different experts across requests, inflating memory access and communication during decoding and eroding real-world latency gains. The challenge: reduce active experts at decode time without retraining or damaging model capability.

Main Contribution

SERE: a similarity-based, dynamic re‑routing method that redirects tokens from secondary experts to the most similar primary experts at decode time, preserving critical experts via a similarity threshold.

A precompute-and-reuse pipeline: compute per-layer expert similarity matrices on a small calibration set and use them at runtime—no model retraining required.

Key Findings

SERE reduces decoding latency substantially in batched MoE serving.

NumbersUp to 2.0× speedup (reported for Qwen1.5 at QPS=24); Qwen3 TPOT 44.4032.12 ms (≈1.38×)

Practical UseDeploy SERE in batched vLLM inference to lower decoding latency and cost without retraining.

Evidence RefSec.4.3, Fig.6, Table3

Quality loss is small under typical settings (Top-2 primary experts).

NumbersExamples: Qwen1.5 avg acc 48.5247.25 (~97.4% retained); DeepSeekV2 54.7655.48 (no loss); Qwen3 82.2480.37 (~97.7%).

Practical UseYou can trade a few percent of accuracy for meaningful latency gains in many tasks.

Evidence RefSec.4.2, Tables 1–3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Max reported speedup2.0×baseline MoE decode2.0× fasterQwen1.5 at QPS=24 (reported in Sec.4.3/Fig.6)Sec.4.3: reports up to 2.0× speedup under evaluated QPS
TPOT (Qwen3-30B-A3B)32.12 ms44.40 ms (Qwen3 top-8 baseline)≈1.38× speedupOpenCompass accuracy/TPOT (QPS=16), Table3Table3: baseline 44.40→SERE top2 32.12 ms

What To Try In 7 Days

Run a small calibration pass (e.g., 400×128 FineWeb‑Edu) and compute Frobenius similarity matrices.

Integrate the SERE CUDA kernel into your vLLM deployment (single-line change from authors' repo).

Start with Top-2 primary experts and tune similarity threshold ρ to balance TPOT vs. accuracy on your key tasks.

Optimization Features

Token Efficiency
batch-aware reduction of active experts
Infra Optimization
reduces memory-access and communication during decoding
Model Optimization
preserve model weights (no retraining)preserve critical experts at runtime
System Optimization
custom CUDA kernel for re-routingplug-and-play vLLM integration
Inference Optimization
dynamic expert skippingsimilarity-based re-routingprimary-expert selection (Top-S union)precomputed activation-based similarity matrices

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

SERE speeds memory-bound decoding but does not reduce computation FLOPs; prefill (compute-bound) stage sees little benefit.

Effectiveness varies with MoE architecture: models with few specialized experts are more sensitive to aggressive skipping.

When Not To Use

When your bottleneck is compute-bound prefill rather than memory-bound decoding.

If your model has very few or highly specialized experts and cannot tolerate any expert substitution.

Failure Modes

Over-aggressive skipping causes large accuracy drops on reasoning and code tasks.

Incorrect similarity estimates (poor calibration) can re-route tokens to unsuitable experts.

Core Entities

Models

Qwen1.5-MoE-A2.7B-ChatDeepSeekV2-LiteQwen3-30B-A3B

Metrics

Time per Output Token (TPOT)Accuracy

Datasets

FineWeb-EduC4WIKIOpenCompass (benchmark suite)

Benchmarks

CMMLUBoolQBBHMathGSM8KMath401HumanEvalMBPP