Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
CITER lowers memory-transfer costs and latency during decoding by routing only the important tokens to the big model, cutting cloud/GPU bill and improving realtime responsiveness while keeping quality.
Summary TLDR
CITER trains a small router that decides, token by token, whether a cheap small model (SLM) or an expensive large model (LLM) should generate the next token. The router is trained with pairwise preferences and an RL-style objective that rewards keeping accuracy while lowering inference cost (measured as data transfer / KV cache movement). A shortcut uses ground-truth tokens to avoid full rollouts for most tokens, cutting training cost. On five benchmarks, CITER matches LLM quality while reducing memory-transfer-based inference cost by ~27–32% versus prior token-level baselines, and it converges in two training rounds.
Problem Statement
Large language models give high-quality text but are costly at decode time. Existing routing methods usually pick one model per query, wasting compute when only a few tokens actually need the LLM. The paper asks: can we route at token granularity to send only critical tokens to the LLM, thereby saving inference cost while keeping output quality?
Main Contribution
Token-level routing framework (CITER) that chooses SLM or LLM per token to reduce inference cost.
Preference-based router training using pairwise labels and a surrogate shortcut to avoid full-rollout reward evaluation.
Iterative training and KV-cache reuse to reduce switching overhead and speed convergence.
Empirical evaluation on 5 benchmarks showing sizable cost/latency gains over query-level and prior token-level baselines.
Key Findings
CITER reduces inference data-transfer cost vs prior token-level baseline on evaluated benchmarks.
Training the router to account for long-term effects materially improves trade-offs.
Router training converges quickly with iterative preference collection.
Latency vs accuracy trade-offs: CITER hits low-latency points other methods do not.
Results
inference_data_transfer
Accuracy
cost_reduction_vs_ablation
Accuracy
compatibility_cost_reduction
Who Should Care
What To Try In 7 Days
Measure per-token KV-cache movement on your decoder and confirm decoding is memory-bound (Sec E).
Swap in a small router MLP that reads the SLM hidden state and try a simple threshold τ to route tokens to your SLM/LLM.
Use the paper's shortcut: label tokens as SLM-preferred when SLM next-token equals ground truth to avoid full rollouts for most tokens.
Agent Features
Memory
- uses SLM hidden state input to router
- maintains separate KV caches for SLM and LLM
Tool Use
- KV-cache reuse
- threshold-based routing
Architectures
- router MLP
Collaboration
- two-model collaboration (SLM + LLM)
Optimization Features
Token Efficiency
- route non-critical tokens to SLM
- threshold τ to trade quality vs cost
Infra Optimization
- optimize for memory-bound decode (KV cache movement)
System Optimization
- measure data-transformation amount as cost
- deploy SLM and LLM on same device to avoid switch cost
Training Optimization
- preference-based policy optimization
- iterative preference collection
- surrogate reward shortcut (use ground-truth next token)
Inference Optimization
- token-level routing
- model switching per token
- KV-cache reuse to avoid recompute
Reproducibility
Data Urls
- Commonsense QA
- ARC-Challenge
- MMLU-Professional Psychology
- GSM8k
- MATH
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires a reasonably capable SLM; if SLM is very weak, routing gains shrink.
- Router training needs LLM calls for preference labels, adding training-time cost.
- Shortcut relies on ground-truth next-token checks; real-world unlabeled deployment needs a different labeling strategy.
- Evaluation focuses on academic QA/math benchmarks; results on open-ended generation may differ.
When Not To Use
- When an SLM has much lower token accuracy than tested SLMs.
- When you cannot afford LLM calls during router training.
- When your workload is fully open-ended and lacks ground-truth tokens for shortcut labeling.
Failure Modes
- Router misroutes a critical token to SLM, causing irrecoverable downstream errors.
- Preference labeling bias if both models systematically fail on a class of tokens.
- Reward hacking or overfitting the router to dataset-specific token patterns.
Core Entities
Models
- Qwen2-1.5B
- Qwen2-7B
- Qwen2-72B
- Llama3.1-8B
- Llama3.1-70B
Metrics
- Accuracy
- data_transformation_amount
- latency
Datasets
- Commonsense QA
- ARC-Challenge
- MMLU-Professional Psychology
- GSM8k
- MATH
Benchmarks
- Commonsense QA
- ARC-Challenge
- MMLU-Professional Psychology
- GSM8k
- MATH

