Route each token to a small or large model to cut memory movement and speed up LLM decoding

February 4, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

Links

Abstract / PDF

Why It Matters For Business

CITER lowers memory-transfer costs and latency during decoding by routing only the important tokens to the big model, cutting cloud/GPU bill and improving realtime responsiveness while keeping quality.

Summary TLDR

CITER trains a small router that decides, token by token, whether a cheap small model (SLM) or an expensive large model (LLM) should generate the next token. The router is trained with pairwise preferences and an RL-style objective that rewards keeping accuracy while lowering inference cost (measured as data transfer / KV cache movement). A shortcut uses ground-truth tokens to avoid full rollouts for most tokens, cutting training cost. On five benchmarks, CITER matches LLM quality while reducing memory-transfer-based inference cost by ~27–32% versus prior token-level baselines, and it converges in two training rounds.

Problem Statement

Large language models give high-quality text but are costly at decode time. Existing routing methods usually pick one model per query, wasting compute when only a few tokens actually need the LLM. The paper asks: can we route at token granularity to send only critical tokens to the LLM, thereby saving inference cost while keeping output quality?

Main Contribution

Token-level routing framework (CITER) that chooses SLM or LLM per token to reduce inference cost.

Preference-based router training using pairwise labels and a surrogate shortcut to avoid full-rollout reward evaluation.

Iterative training and KV-cache reuse to reduce switching overhead and speed convergence.

Empirical evaluation on 5 benchmarks showing sizable cost/latency gains over query-level and prior token-level baselines.

Key Findings

CITER reduces inference data-transfer cost vs prior token-level baseline on evaluated benchmarks.

NumbersUp to 27% fewer data-transfers (Qwen family); up to 32% on Llama3.1

Training the router to account for long-term effects materially improves trade-offs.

NumbersCompared to ablation (CITER-S), up to 42% cost reduction or 23% accuracy gain on benchmarks

Router training converges quickly with iterative preference collection.

NumbersSecond iteration gives ~5% extra cost reduction and 2–3% accuracy uplift vs first

Latency vs accuracy trade-offs: CITER hits low-latency points other methods do not.

NumbersExample: 80.8% accuracy at 4.0s latency (CITER) vs 87.0% at 10.9s (Speculative Decoding)

Results

inference_data_transfer

Valueup to 27% reduction

BaselineCo-LLM (token-level prior work)

Accuracy

Valueup to +17% accuracy

BaselineCo-LLM

cost_reduction_vs_ablation

Valueup to 42% reduction

BaselineCITER-S (no long-term influence)

Accuracy

Value80.8% @ 4.0s latency

BaselineSpeculative Decoding example

compatibility_cost_reduction

Valueup to 32% reduction

BaselineCo-LLM on Llama3.1 series

Who Should Care

What To Try In 7 Days

Measure per-token KV-cache movement on your decoder and confirm decoding is memory-bound (Sec E).

Swap in a small router MLP that reads the SLM hidden state and try a simple threshold τ to route tokens to your SLM/LLM.

Use the paper's shortcut: label tokens as SLM-preferred when SLM next-token equals ground truth to avoid full rollouts for most tokens.

Agent Features

Memory

  • uses SLM hidden state input to router
  • maintains separate KV caches for SLM and LLM

Tool Use

  • KV-cache reuse
  • threshold-based routing

Architectures

  • router MLP

Collaboration

  • two-model collaboration (SLM + LLM)

Optimization Features

Token Efficiency

  • route non-critical tokens to SLM
  • threshold τ to trade quality vs cost

Infra Optimization

  • optimize for memory-bound decode (KV cache movement)

System Optimization

  • measure data-transformation amount as cost
  • deploy SLM and LLM on same device to avoid switch cost

Training Optimization

  • preference-based policy optimization
  • iterative preference collection
  • surrogate reward shortcut (use ground-truth next token)

Inference Optimization

  • token-level routing
  • model switching per token
  • KV-cache reuse to avoid recompute

Reproducibility

Data Urls

  • Commonsense QA
  • ARC-Challenge
  • MMLU-Professional Psychology
  • GSM8k
  • MATH

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires a reasonably capable SLM; if SLM is very weak, routing gains shrink.
  • Router training needs LLM calls for preference labels, adding training-time cost.
  • Shortcut relies on ground-truth next-token checks; real-world unlabeled deployment needs a different labeling strategy.
  • Evaluation focuses on academic QA/math benchmarks; results on open-ended generation may differ.

When Not To Use

  • When an SLM has much lower token accuracy than tested SLMs.
  • When you cannot afford LLM calls during router training.
  • When your workload is fully open-ended and lacks ground-truth tokens for shortcut labeling.

Failure Modes

  • Router misroutes a critical token to SLM, causing irrecoverable downstream errors.
  • Preference labeling bias if both models systematically fail on a class of tokens.
  • Reward hacking or overfitting the router to dataset-specific token patterns.

Core Entities

Models

  • Qwen2-1.5B
  • Qwen2-7B
  • Qwen2-72B
  • Llama3.1-8B
  • Llama3.1-70B

Metrics

  • Accuracy
  • data_transformation_amount
  • latency

Datasets

  • Commonsense QA
  • ARC-Challenge
  • MMLU-Professional Psychology
  • GSM8k
  • MATH

Benchmarks

  • Commonsense QA
  • ARC-Challenge
  • MMLU-Professional Psychology
  • GSM8k
  • MATH