Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

Overview

Decision SnapshotNeeds Validation

Method needs only a precomputed similarity matrix and a CUDA kernel to yield practical latency gains for batched MoE decoding; experiments across three MoE models support its effectiveness.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SERE lowers batched decoding latency and inference cost for MoE-based LLMs with minimal accuracy loss and with a one-line vLLM integration, making it practical to reduce serving bills and improve responsiveness.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Founder

Summary TLDR

SERE is a runtime method for Mixture-of-Experts (MoE) models that reduces the number of active experts during batched decoding by re-routing tokens from low-impact experts to their most similar retained experts. The similarity matrix is precomputed on a small calibration set (no retraining). With a CUDA kernel and a one-line integration into vLLM, SERE cuts decoding latency up to 2.0× in experiments while keeping most model quality. It is best for latency- and cost-sensitive MoE serving; it does not reduce FLOPs and gives little benefit in compute-bound prefill stages.

Problem Statement

MoE models are efficient per token but batched serving activates many different experts across requests, inflating memory access and communication during decoding and eroding real-world latency gains. The challenge: reduce active experts at decode time without retraining or damaging model capability.

Main Contribution

SERE: a similarity-based, dynamic re‑routing method that redirects tokens from secondary experts to the most similar primary experts at decode time, preserving critical experts via a similarity threshold.

A precompute-and-reuse pipeline: compute per-layer expert similarity matrices on a small calibration set and use them at runtime—no model retraining required.

Key Findings

SERE reduces decoding latency substantially in batched MoE serving.

NumbersUp to 2.0× speedup (reported for Qwen1.5 at QPS=24); Qwen3 TPOT 44.40→32.12 ms (≈1.38×)

Practical UseDeploy SERE in batched vLLM inference to lower decoding latency and cost without retraining.

Evidence RefSec.4.3, Fig.6, Table3

Quality loss is small under typical settings (Top-2 primary experts).

NumbersExamples: Qwen1.5 avg acc 48.52→47.25 (~97.4% retained); DeepSeekV2 54.76→55.48 (no loss); Qwen3 82.24→80.37 (~97.7%).

Practical UseYou can trade a few percent of accuracy for meaningful latency gains in many tasks.

Evidence RefSec.4.2, Tables 1–3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Max reported speedup	2.0×	baseline MoE decode	2.0× faster	Qwen1.5 at QPS=24 (reported in Sec.4.3/Fig.6)	Sec.4.3: reports up to 2.0× speedup under evaluated QPS	—
TPOT (Qwen3-30B-A3B)	32.12 ms	44.40 ms (Qwen3 top-8 baseline)	≈1.38× speedup	OpenCompass accuracy/TPOT (QPS=16), Table3	Table3: baseline 44.40→SERE top2 32.12 ms	—

What To Try In 7 Days

Run a small calibration pass (e.g., 400×128 FineWeb‑Edu) and compute Frobenius similarity matrices.

Integrate the SERE CUDA kernel into your vLLM deployment (single-line change from authors' repo).

Start with Top-2 primary experts and tune similarity threshold ρ to balance TPOT vs. accuracy on your key tasks.

Optimization Features

Token Efficiency

batch-aware reduction of active experts

Infra Optimization

reduces memory-access and communication during decoding

Model Optimization

preserve model weights (no retraining)preserve critical experts at runtime

System Optimization

custom CUDA kernel for re-routingplug-and-play vLLM integration

Inference Optimization

dynamic expert skippingsimilarity-based re-routingprimary-expert selection (Top-S union)precomputed activation-based similarity matrices

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/JL-Cheng/SERE

Data URLs

https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu https://www.tensorflow.org/datasets/catalog/c4

Risks & Boundaries

Limitations

SERE speeds memory-bound decoding but does not reduce computation FLOPs; prefill (compute-bound) stage sees little benefit.

Effectiveness varies with MoE architecture: models with few specialized experts are more sensitive to aggressive skipping.

When Not To Use

When your bottleneck is compute-bound prefill rather than memory-bound decoding.

If your model has very few or highly specialized experts and cannot tolerate any expert substitution.

Failure Modes

Over-aggressive skipping causes large accuracy drops on reasoning and code tasks.

Incorrect similarity estimates (poor calibration) can re-route tokens to unsuitable experts.

Core Entities

Models

Qwen1.5-MoE-A2.7B-ChatDeepSeekV2-LiteQwen3-30B-A3B

Metrics

Time per Output Token (TPOT)Accuracy

Datasets

FineWeb-EduC4WIKIOpenCompass (benchmark suite)

Benchmarks

CMMLUBoolQBBHMathGSM8KMath401HumanEvalMBPP

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SERE reduces decoding latency substantially in batched MoE serving.

Quality loss is small under typical settings (Top-2 primary experts).

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

Key finding

Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

Key finding

Find which MoE experts actually use context, then only tune them — big gains with far fewer trainable parameters.

Key finding