Use token-level and hidden-state confidence to route queries to smaller models and cut inference cost with little accuracy loss

February 25, 20267 min

Overview

Decision SnapshotNeeds Validation

The method is simple and implementable: P(T) needs only token probs and P(IK) is a small classifier on hidden states; it reduces compute/token cost in experiments but requires per-domain validation and may add latency for escalated queries.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cut API and compute bills substantially by asking a small model if it 'knows' and how confident its token choice is, then only escalate uncertain queries to a bigger model.

Who Should Care

Summary TLDR

The paper introduces a practical routing method that asks a small model if it 'knows' (P(IK)) and how confident its chosen answer token is (P(T)). If either score is low, the query is escalated to a larger model. On MMLU this saves about 20–40% compute versus always using the largest model while keeping accuracy nearly the same (example: 8B→70B 83.22% vs 70B 83.57%). Applied to GPT-4o, a 70B→GPT-4o cascade cut token use by ≈60% with comparable accuracy. P(T) needs only token probabilities; P(IK) is a small classifier built from hidden states.

Problem Statement

Large LLMs give better answers but cost a lot. Running the biggest model for every query is expensive. We need a cheap, reliable way to decide when a small model is good enough and when to call a bigger model.

Main Contribution

A practical two-signal routing rule using P(T) (token-prob confidence) and P(IK) (a classifier on hidden states) to decide escalation between models.

Empirical demonstration on MMLU and PopQA that cascades (e.g., 8B→70B) cut compute or token cost 20–60% while keeping accuracy near the largest model.

Key Findings

8B→70B cascade matches 70B accuracy with much less compute

NumbersAcc: 8B70B 83.22% vs 70B 83.57%; Reduced CC 36.46%; PD -0.35%

Practical UseUse an 8B model first and escalate to 70B when confidence is low to save ~36% compute while keeping accuracy effectively unchanged on MMLU.

Evidence RefTable 1, Table 15

70B→GPT-4o routing reduces API token usage by about 60%

NumbersReduced Tokens ≈59.96%; Acc: 70B→GPT-4o 86.85% vs GPT-4o 86.43%

Practical UsePut a 70B model as an intermediate step before calling GPT-4o to cut token/API cost by ~60% with no loss in accuracy on the evaluated tasks.

Evidence RefTable 4, Table 13

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy8B70B 83.22%70B 83.57%-0.35% PDMMLUTable 1 shows 8B→70B 83.22% vs 70B 83.57%; reduced compute 36.46%.Table 1
Compute reduction (8B→70B)36.46% reduced CC70B computeMMLUTable 1 reports 36.46% reduced computational cost for 8B→70B.Table 1

What To Try In 7 Days

Measure P(T) with token probabilities for your multiple-choice or formatted responses and set a threshold around 0.9.

Train a small P(IK) classifier using hidden states on held-out examples to stabilize routing decisions.

Pilot one cascade (e.g., 8B→70B or 70B→GPT-4o) and compare token/compute use and end-to-end latency against always-calling the large model.

Optimization Features

Token Efficiency
Token usage reduction via cascades
Inference Optimization
Model CascadesModel RoutingEfficient InferenceToken EfficiencyCompute Cost Estimation

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

MMLUGPQAPopQA

Risks & Boundaries

Limitations

P(IK) requires labeled examples and may not generalize to very different domains.

Experiments focus on NLU and multiple-choice; generative tasks need special handling and extra evaluation.

When Not To Use

Real-time systems that cannot accept extra latency from escalations.

Domains where you cannot obtain training data for a reliable P(IK) classifier and black-box signals are weak.

Failure Modes

Small model is overconfident and retains wrong answers (overconfidence).

P(IK) classifier is conservative on OOD data, routing many queries to large models and reducing savings.

Core Entities

Models

Llama-3.2-3B-InstructMeta-Llama-3.1-8B-InstructLlama-3.3-70B-InstructQwen3-4BQwen3-8BQwen3-32BGPT-4o

Metrics

AccuracyReduced CC (GFLOPs)Tokens consumedHallucination rateMacro-F1AUROC (P(IK) classifier)

Datasets

MMLUGPQAPopQA

Benchmarks

MMLUGPQAPopQA