Overview
The method is practical: a small router plus an SLM can cut decoding memory transfer and latency on typical benchmarks; experiments use public datasets and common model families but require integration of KV-cache reuse and router training infrastructure.
Citations0
Evidence Strength0.75
Confidence0.75
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
CITER lowers memory-transfer costs and latency during decoding by routing only the important tokens to the big model, cutting cloud/GPU bill and improving realtime responsiveness while keeping quality.
Who Should Care
Summary TLDR
CITER trains a small router that decides, token by token, whether a cheap small model (SLM) or an expensive large model (LLM) should generate the next token. The router is trained with pairwise preferences and an RL-style objective that rewards keeping accuracy while lowering inference cost (measured as data transfer / KV cache movement). A shortcut uses ground-truth tokens to avoid full rollouts for most tokens, cutting training cost. On five benchmarks, CITER matches LLM quality while reducing memory-transfer-based inference cost by ~27–32% versus prior token-level baselines, and it converges in two training rounds.
Problem Statement
Large language models give high-quality text but are costly at decode time. Existing routing methods usually pick one model per query, wasting compute when only a few tokens actually need the LLM. The paper asks: can we route at token granularity to send only critical tokens to the LLM, thereby saving inference cost while keeping output quality?
Main Contribution
Token-level routing framework (CITER) that chooses SLM or LLM per token to reduce inference cost.
Preference-based router training using pairwise labels and a surrogate shortcut to avoid full-rollout reward evaluation.
Key Findings
CITER reduces inference data-transfer cost vs prior token-level baseline on evaluated benchmarks.
Training the router to account for long-term effects materially improves trade-offs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| inference_data_transfer | up to 27% reduction | Co-LLM (token-level prior work) | −27% data-transfers (Qwen family) | Aggregated across five benchmarks (see Sec 3.2) | CITER achieves comparable accuracy with up to 27% fewer inference costs vs Co-LLM | Sec 3.2 Fig.2 |
| Accuracy | up to +17% accuracy | Co-LLM | +17 percentage points (on evaluated benchmarks) | Aggregated across five benchmarks (see Sec 3.2) | CITER can deliver up to 17% higher accuracy at the same cost vs Co-LLM | Sec 3.2 |
What To Try In 7 Days
Measure per-token KV-cache movement on your decoder and confirm decoding is memory-bound (Sec E).
Swap in a small router MLP that reads the SLM hidden state and try a simple threshold τ to route tokens to your SLM/LLM.
Use the paper's shortcut: label tokens as SLM-preferred when SLM next-token equals ground truth to avoid full rollouts for most tokens.
Agent Features
Memory
Tool Use
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires a reasonably capable SLM; if SLM is very weak, routing gains shrink.
Router training needs LLM calls for preference labels, adding training-time cost.
When Not To Use
When an SLM has much lower token accuracy than tested SLMs.
When you cannot afford LLM calls during router training.
Failure Modes
Router misroutes a critical token to SLM, causing irrecoverable downstream errors.
Preference labeling bias if both models systematically fail on a class of tokens.

