Route each token to a small or large model to cut memory movement and speed up LLM decoding

February 4, 20258 min

Overview

Decision SnapshotReady For Pilot

The method is practical: a small router plus an SLM can cut decoding memory transfer and latency on typical benchmarks; experiments use public datasets and common model families but require integration of KV-cache reuse and router training infrastructure.

Citations0

Evidence Strength0.75

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

Links

Abstract / PDF / Data

Why It Matters For Business

CITER lowers memory-transfer costs and latency during decoding by routing only the important tokens to the big model, cutting cloud/GPU bill and improving realtime responsiveness while keeping quality.

Who Should Care

Summary TLDR

CITER trains a small router that decides, token by token, whether a cheap small model (SLM) or an expensive large model (LLM) should generate the next token. The router is trained with pairwise preferences and an RL-style objective that rewards keeping accuracy while lowering inference cost (measured as data transfer / KV cache movement). A shortcut uses ground-truth tokens to avoid full rollouts for most tokens, cutting training cost. On five benchmarks, CITER matches LLM quality while reducing memory-transfer-based inference cost by ~27–32% versus prior token-level baselines, and it converges in two training rounds.

Problem Statement

Large language models give high-quality text but are costly at decode time. Existing routing methods usually pick one model per query, wasting compute when only a few tokens actually need the LLM. The paper asks: can we route at token granularity to send only critical tokens to the LLM, thereby saving inference cost while keeping output quality?

Main Contribution

Token-level routing framework (CITER) that chooses SLM or LLM per token to reduce inference cost.

Preference-based router training using pairwise labels and a surrogate shortcut to avoid full-rollout reward evaluation.

Key Findings

CITER reduces inference data-transfer cost vs prior token-level baseline on evaluated benchmarks.

NumbersUp to 27% fewer data-transfers (Qwen family); up to 32% on Llama3.1

Practical UseExpect roughly 25–30% lower memory-transfer cost when swapping a prior token router for CITER on similar tasks and models; this reduces runtime and hardware pressure where decoding is memory-bound.

Evidence RefSec 3.2 Fig.2; Sec 3.5 Fig.6

Training the router to account for long-term effects materially improves trade-offs.

NumbersCompared to ablation (CITER-S), up to 42% cost reduction or 23% accuracy gain on benchmarks

Practical UseWhen you train a token router, include rollouts or roll-in logic that captures downstream impact — simpler greedy labels can give much worse cost-accuracy trade-offs.

Evidence RefSec 3.3 Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
inference_data_transferup to 27% reductionCo-LLM (token-level prior work)−27% data-transfers (Qwen family)Aggregated across five benchmarks (see Sec 3.2)CITER achieves comparable accuracy with up to 27% fewer inference costs vs Co-LLMSec 3.2 Fig.2
Accuracyup to +17% accuracyCo-LLM+17 percentage points (on evaluated benchmarks)Aggregated across five benchmarks (see Sec 3.2)CITER can deliver up to 17% higher accuracy at the same cost vs Co-LLMSec 3.2

What To Try In 7 Days

Measure per-token KV-cache movement on your decoder and confirm decoding is memory-bound (Sec E).

Swap in a small router MLP that reads the SLM hidden state and try a simple threshold τ to route tokens to your SLM/LLM.

Use the paper's shortcut: label tokens as SLM-preferred when SLM next-token equals ground truth to avoid full rollouts for most tokens.

Agent Features

Memory
uses SLM hidden state input to routermaintains separate KV caches for SLM and LLM
Tool Use
KV-cache reusethreshold-based routing
Architectures
router MLP
Collaboration
two-model collaboration (SLM + LLM)

Optimization Features

Token Efficiency
route non-critical tokens to SLMthreshold τ to trade quality vs cost
Infra Optimization
optimize for memory-bound decode (KV cache movement)
System Optimization
measure data-transformation amount as costdeploy SLM and LLM on same device to avoid switch cost
Training Optimization
preference-based policy optimizationiterative preference collectionsurrogate reward shortcut (use ground-truth next token)
Inference Optimization
token-level routingmodel switching per tokenKV-cache reuse to avoid recompute

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Commonsense QAARC-ChallengeMMLU-Professional PsychologyGSM8kMATH

Risks & Boundaries

Limitations

Requires a reasonably capable SLM; if SLM is very weak, routing gains shrink.

Router training needs LLM calls for preference labels, adding training-time cost.

When Not To Use

When an SLM has much lower token accuracy than tested SLMs.

When you cannot afford LLM calls during router training.

Failure Modes

Router misroutes a critical token to SLM, causing irrecoverable downstream errors.

Preference labeling bias if both models systematically fail on a class of tokens.

Core Entities

Models

Qwen2-1.5BQwen2-7BQwen2-72BLlama3.1-8BLlama3.1-70B

Metrics

Accuracydata_transformation_amountlatency

Datasets

Commonsense QAARC-ChallengeMMLU-Professional PsychologyGSM8kMATH

Benchmarks

Commonsense QAARC-ChallengeMMLU-Professional PsychologyGSM8kMATH