Route each token to a small or large model to cut memory movement and speed up LLM decoding

Overview

Decision SnapshotReady For Pilot

The method is practical: a small router plus an SLM can cut decoding memory transfer and latency on typical benchmarks; experiments use public datasets and common model families but require integration of KV-cache reuse and router training infrastructure.

Citations0

Evidence Strength0.75

Confidence0.75

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao

Links

Abstract / PDF / Data

Why It Matters For Business

CITER lowers memory-transfer costs and latency during decoding by routing only the important tokens to the big model, cutting cloud/GPU bill and improving realtime responsiveness while keeping quality.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

CITER trains a small router that decides, token by token, whether a cheap small model (SLM) or an expensive large model (LLM) should generate the next token. The router is trained with pairwise preferences and an RL-style objective that rewards keeping accuracy while lowering inference cost (measured as data transfer / KV cache movement). A shortcut uses ground-truth tokens to avoid full rollouts for most tokens, cutting training cost. On five benchmarks, CITER matches LLM quality while reducing memory-transfer-based inference cost by ~27–32% versus prior token-level baselines, and it converges in two training rounds.

Problem Statement

Large language models give high-quality text but are costly at decode time. Existing routing methods usually pick one model per query, wasting compute when only a few tokens actually need the LLM. The paper asks: can we route at token granularity to send only critical tokens to the LLM, thereby saving inference cost while keeping output quality?

Main Contribution

Token-level routing framework (CITER) that chooses SLM or LLM per token to reduce inference cost.

Preference-based router training using pairwise labels and a surrogate shortcut to avoid full-rollout reward evaluation.

Key Findings

CITER reduces inference data-transfer cost vs prior token-level baseline on evaluated benchmarks.

NumbersUp to 27% fewer data-transfers (Qwen family); up to 32% on Llama3.1

Practical UseExpect roughly 25–30% lower memory-transfer cost when swapping a prior token router for CITER on similar tasks and models; this reduces runtime and hardware pressure where decoding is memory-bound.

Evidence RefSec 3.2 Fig.2; Sec 3.5 Fig.6

Training the router to account for long-term effects materially improves trade-offs.

NumbersCompared to ablation (CITER-S), up to 42% cost reduction or 23% accuracy gain on benchmarks

Practical UseWhen you train a token router, include rollouts or roll-in logic that captures downstream impact — simpler greedy labels can give much worse cost-accuracy trade-offs.

Evidence RefSec 3.3 Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
inference_data_transfer	up to 27% reduction	Co-LLM (token-level prior work)	−27% data-transfers (Qwen family)	Aggregated across five benchmarks (see Sec 3.2)	CITER achieves comparable accuracy with up to 27% fewer inference costs vs Co-LLM	Sec 3.2 Fig.2
Accuracy	up to +17% accuracy	Co-LLM	+17 percentage points (on evaluated benchmarks)	Aggregated across five benchmarks (see Sec 3.2)	CITER can deliver up to 17% higher accuracy at the same cost vs Co-LLM	Sec 3.2

What To Try In 7 Days

Measure per-token KV-cache movement on your decoder and confirm decoding is memory-bound (Sec E).

Swap in a small router MLP that reads the SLM hidden state and try a simple threshold τ to route tokens to your SLM/LLM.

Use the paper's shortcut: label tokens as SLM-preferred when SLM next-token equals ground truth to avoid full rollouts for most tokens.

Agent Features

Memory

uses SLM hidden state input to routermaintains separate KV caches for SLM and LLM

Tool Use

KV-cache reusethreshold-based routing

Architectures

router MLP

Collaboration

two-model collaboration (SLM + LLM)

Optimization Features

Token Efficiency

route non-critical tokens to SLMthreshold τ to trade quality vs cost

Infra Optimization

optimize for memory-bound decode (KV cache movement)

System Optimization

measure data-transformation amount as costdeploy SLM and LLM on same device to avoid switch cost

Training Optimization

preference-based policy optimizationiterative preference collectionsurrogate reward shortcut (use ground-truth next token)

Inference Optimization

token-level routingmodel switching per tokenKV-cache reuse to avoid recompute

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Commonsense QAARC-ChallengeMMLU-Professional PsychologyGSM8kMATH

Risks & Boundaries

Limitations

Requires a reasonably capable SLM; if SLM is very weak, routing gains shrink.

Router training needs LLM calls for preference labels, adding training-time cost.

When Not To Use

When an SLM has much lower token accuracy than tested SLMs.

When you cannot afford LLM calls during router training.

Failure Modes

Router misroutes a critical token to SLM, causing irrecoverable downstream errors.

Preference labeling bias if both models systematically fail on a class of tokens.

Core Entities

Models

Qwen2-1.5BQwen2-7BQwen2-72BLlama3.1-8BLlama3.1-70B

Metrics

Accuracydata_transformation_amountlatency

Datasets

Commonsense QAARC-ChallengeMMLU-Professional PsychologyGSM8kMATH

Benchmarks

Commonsense QAARC-ChallengeMMLU-Professional PsychologyGSM8kMATH

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CITER reduces inference data-transfer cost vs prior token-level baseline on evaluated benchmarks.

Training the router to account for long-term effects materially improves trade-offs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

Key finding

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Key finding

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Key finding

RouterEval: a 200M-record benchmark showing router-based model routing can scale LLM performance by combining many weak models

Key finding

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Key finding