Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can cut API and compute bills substantially by asking a small model if it 'knows' and how confident its token choice is, then only escalate uncertain queries to a bigger model.
Summary TLDR
The paper introduces a practical routing method that asks a small model if it 'knows' (P(IK)) and how confident its chosen answer token is (P(T)). If either score is low, the query is escalated to a larger model. On MMLU this saves about 20–40% compute versus always using the largest model while keeping accuracy nearly the same (example: 8B→70B 83.22% vs 70B 83.57%). Applied to GPT-4o, a 70B→GPT-4o cascade cut token use by ≈60% with comparable accuracy. P(T) needs only token probabilities; P(IK) is a small classifier built from hidden states.
Problem Statement
Large LLMs give better answers but cost a lot. Running the biggest model for every query is expensive. We need a cheap, reliable way to decide when a small model is good enough and when to call a bigger model.
Main Contribution
A practical two-signal routing rule using P(T) (token-prob confidence) and P(IK) (a classifier on hidden states) to decide escalation between models.
Empirical demonstration on MMLU and PopQA that cascades (e.g., 8B→70B) cut compute or token cost 20–60% while keeping accuracy near the largest model.
Ablations and OOD tests (GPQA) showing P(IK) improves stability and that P(T)-only routing still works when internal states are unavailable.
Key Findings
8B→70B cascade matches 70B accuracy with much less compute
70B→GPT-4o routing reduces API token usage by about 60%
Training a P(IK) classifier improves routing stability
Method generalizes but is conservative on hard OOD data
Open-ended QA shows cost-accuracy trade-offs and reduced hallucination with cascades
Results
Accuracy
Compute reduction (8B→70B)
GPT-4o token savings (70B→GPT-4o)
P(IK) classifier performance (8B)
Accuracy
Who Should Care
What To Try In 7 Days
Measure P(T) with token probabilities for your multiple-choice or formatted responses and set a threshold around 0.9.
Train a small P(IK) classifier using hidden states on held-out examples to stabilize routing decisions.
Pilot one cascade (e.g., 8B→70B or 70B→GPT-4o) and compare token/compute use and end-to-end latency against always-calling the large model.
Optimization Features
Token Efficiency
- Token usage reduction via cascades
Inference Optimization
- Model Cascades
- Model Routing
- Efficient Inference
- Token Efficiency
- Compute Cost Estimation
Reproducibility
Data Urls
- MMLU
- GPQA
- PopQA
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- P(IK) requires labeled examples and may not generalize to very different domains.
- Experiments focus on NLU and multiple-choice; generative tasks need special handling and extra evaluation.
- Cascading can add latency for queries that escalate through multiple models.
When Not To Use
- Real-time systems that cannot accept extra latency from escalations.
- Domains where you cannot obtain training data for a reliable P(IK) classifier and black-box signals are weak.
- Workflows where token pricing or latency makes multi-call cascades more expensive than single large-model calls.
Failure Modes
- Small model is overconfident and retains wrong answers (overconfidence).
- P(IK) classifier is conservative on OOD data, routing many queries to large models and reducing savings.
- Error accumulation in long cascades (e.g., 3B→8B→70B) causes larger performance drops.
Core Entities
Models
- Llama-3.2-3B-Instruct
- Meta-Llama-3.1-8B-Instruct
- Llama-3.3-70B-Instruct
- Qwen3-4B
- Qwen3-8B
- Qwen3-32B
- GPT-4o
Metrics
- Accuracy
- Reduced CC (GFLOPs)
- Tokens consumed
- Hallucination rate
- Macro-F1
- AUROC (P(IK) classifier)
Datasets
- MMLU
- GPQA
- PopQA
Benchmarks
- MMLU
- GPQA
- PopQA

