Overview
The method is simple and implementable: P(T) needs only token probs and P(IK) is a small classifier on hidden states; it reduces compute/token cost in experiments but requires per-domain validation and may add latency for escalated queries.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
You can cut API and compute bills substantially by asking a small model if it 'knows' and how confident its token choice is, then only escalate uncertain queries to a bigger model.
Who Should Care
Summary TLDR
The paper introduces a practical routing method that asks a small model if it 'knows' (P(IK)) and how confident its chosen answer token is (P(T)). If either score is low, the query is escalated to a larger model. On MMLU this saves about 20–40% compute versus always using the largest model while keeping accuracy nearly the same (example: 8B→70B 83.22% vs 70B 83.57%). Applied to GPT-4o, a 70B→GPT-4o cascade cut token use by ≈60% with comparable accuracy. P(T) needs only token probabilities; P(IK) is a small classifier built from hidden states.
Problem Statement
Large LLMs give better answers but cost a lot. Running the biggest model for every query is expensive. We need a cheap, reliable way to decide when a small model is good enough and when to call a bigger model.
Main Contribution
A practical two-signal routing rule using P(T) (token-prob confidence) and P(IK) (a classifier on hidden states) to decide escalation between models.
Empirical demonstration on MMLU and PopQA that cascades (e.g., 8B→70B) cut compute or token cost 20–60% while keeping accuracy near the largest model.
Key Findings
8B→70B cascade matches 70B accuracy with much less compute
70B→GPT-4o routing reduces API token usage by about 60%
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 8B→70B 83.22% | 70B 83.57% | -0.35% PD | MMLU | Table 1 shows 8B→70B 83.22% vs 70B 83.57%; reduced compute 36.46%. | Table 1 |
| Compute reduction (8B→70B) | 36.46% reduced CC | 70B compute | — | MMLU | Table 1 reports 36.46% reduced computational cost for 8B→70B. | Table 1 |
What To Try In 7 Days
Measure P(T) with token probabilities for your multiple-choice or formatted responses and set a threshold around 0.9.
Train a small P(IK) classifier using hidden states on held-out examples to stabilize routing decisions.
Pilot one cascade (e.g., 8B→70B or 70B→GPT-4o) and compare token/compute use and end-to-end latency against always-calling the large model.
Optimization Features
Token Efficiency
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
P(IK) requires labeled examples and may not generalize to very different domains.
Experiments focus on NLU and multiple-choice; generative tasks need special handling and extra evaluation.
When Not To Use
Real-time systems that cannot accept extra latency from escalations.
Domains where you cannot obtain training data for a reliable P(IK) classifier and black-box signals are weak.
Failure Modes
Small model is overconfident and retains wrong answers (overconfidence).
P(IK) classifier is conservative on OOD data, routing many queries to large models and reducing savings.

