Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

February 7, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan

Links

Abstract / PDF

Why It Matters For Business

SERE lowers batched decoding latency and inference cost for MoE-based LLMs with minimal accuracy loss and with a one-line vLLM integration, making it practical to reduce serving bills and improve responsiveness.

Summary TLDR

SERE is a runtime method for Mixture-of-Experts (MoE) models that reduces the number of active experts during batched decoding by re-routing tokens from low-impact experts to their most similar retained experts. The similarity matrix is precomputed on a small calibration set (no retraining). With a CUDA kernel and a one-line integration into vLLM, SERE cuts decoding latency up to 2.0× in experiments while keeping most model quality. It is best for latency- and cost-sensitive MoE serving; it does not reduce FLOPs and gives little benefit in compute-bound prefill stages.

Problem Statement

MoE models are efficient per token but batched serving activates many different experts across requests, inflating memory access and communication during decoding and eroding real-world latency gains. The challenge: reduce active experts at decode time without retraining or damaging model capability.

Main Contribution

SERE: a similarity-based, dynamic re‑routing method that redirects tokens from secondary experts to the most similar primary experts at decode time, preserving critical experts via a similarity threshold.

A precompute-and-reuse pipeline: compute per-layer expert similarity matrices on a small calibration set and use them at runtime—no model retraining required.

A high-performance CUDA kernel and plug-and-play integration with vLLM (single-line change) to realize practical speedups in production.

Key Findings

SERE reduces decoding latency substantially in batched MoE serving.

NumbersUp to 2.0× speedup (reported for Qwen1.5 at QPS=24); Qwen3 TPOT 44.40→32.12 ms (≈1.38×)

Quality loss is small under typical settings (Top-2 primary experts).

NumbersExamples: Qwen1.5 avg acc 48.52→47.25 (~97.4% retained); DeepSeekV2 54.76→55.48 (no loss); Qwen3 82.24→80.37 (~97.7%).

SERE is robust to calibration choices and similarity metrics and is cheap to compute.

NumbersActivation-based Frobenius similarity calibration took ~28s vs CKA variants much slower; K=2 performance stable across C

A CUDA kernel makes SERE practical and faster than a PyTorch version.

NumbersCUDA implementation ≈1.5× faster than PyTorch implementation for re-routing overhead

Results

Max reported speedup

Value2.0×

Baselinebaseline MoE decode

TPOT (Qwen3-30B-A3B)

Value32.12 ms

Baseline44.40 ms (Qwen3 top-8 baseline)

Accuracy

Value≈97% retained

Baselineoriginal model accuracy

Who Should Care

What To Try In 7 Days

Run a small calibration pass (e.g., 400×128 FineWeb‑Edu) and compute Frobenius similarity matrices.

Integrate the SERE CUDA kernel into your vLLM deployment (single-line change from authors' repo).

Start with Top-2 primary experts and tune similarity threshold ρ to balance TPOT vs. accuracy on your key tasks.

Optimization Features

Token Efficiency

  • batch-aware reduction of active experts

Infra Optimization

  • reduces memory-access and communication during decoding

Model Optimization

  • preserve model weights (no retraining)
  • preserve critical experts at runtime

System Optimization

  • custom CUDA kernel for re-routing
  • plug-and-play vLLM integration

Inference Optimization

  • dynamic expert skipping
  • similarity-based re-routing
  • primary-expert selection (Top-S union)
  • precomputed activation-based similarity matrices

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • SERE speeds memory-bound decoding but does not reduce computation FLOPs; prefill (compute-bound) stage sees little benefit.
  • Effectiveness varies with MoE architecture: models with few specialized experts are more sensitive to aggressive skipping.
  • Requires a calibration dataset to compute activation-based similarity; calibration choices and data volume affect trade-offs.
  • High skipping rates or low similarity thresholds can degrade reasoning-heavy tasks (math/code).

When Not To Use

  • When your bottleneck is compute-bound prefill rather than memory-bound decoding.
  • If your model has very few or highly specialized experts and cannot tolerate any expert substitution.
  • If you cannot run the required calibration pass or install a custom CUDA kernel in production.

Failure Modes

  • Over-aggressive skipping causes large accuracy drops on reasoning and code tasks.
  • Incorrect similarity estimates (poor calibration) can re-route tokens to unsuitable experts.
  • Mismatch between prefill and decode expert sets (inconsistent selection) can introduce distribution shift and reduce quality.

Core Entities

Models

  • Qwen1.5-MoE-A2.7B-Chat
  • DeepSeekV2-Lite
  • Qwen3-30B-A3B

Metrics

  • Time per Output Token (TPOT)
  • Accuracy

Datasets

  • FineWeb-Edu
  • C4
  • WIKI
  • OpenCompass (benchmark suite)

Benchmarks

  • CMMLU
  • BoolQ
  • BBH
  • Math
  • GSM8K
  • Math401
  • HumanEval
  • MBPP