Use token-level and hidden-state confidence to route queries to smaller models and cut inference cost with little accuracy loss

Overview

Decision SnapshotNeeds Validation

The method is simple and implementable: P(T) needs only token probs and P(IK) is a small classifier on hidden states; it reduces compute/token cost in experiments but requires per-domain validation and may add latency for escalated queries.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cut API and compute bills substantially by asking a small model if it 'knows' and how confident its token choice is, then only escalate uncertain queries to a bigger model.

Who Should Care

CTO Product Manager ML Engineer Founder CEO

Summary TLDR

The paper introduces a practical routing method that asks a small model if it 'knows' (P(IK)) and how confident its chosen answer token is (P(T)). If either score is low, the query is escalated to a larger model. On MMLU this saves about 20–40% compute versus always using the largest model while keeping accuracy nearly the same (example: 8B→70B 83.22% vs 70B 83.57%). Applied to GPT-4o, a 70B→GPT-4o cascade cut token use by ≈60% with comparable accuracy. P(T) needs only token probabilities; P(IK) is a small classifier built from hidden states.

Problem Statement

Large LLMs give better answers but cost a lot. Running the biggest model for every query is expensive. We need a cheap, reliable way to decide when a small model is good enough and when to call a bigger model.

Main Contribution

A practical two-signal routing rule using P(T) (token-prob confidence) and P(IK) (a classifier on hidden states) to decide escalation between models.

Empirical demonstration on MMLU and PopQA that cascades (e.g., 8B→70B) cut compute or token cost 20–60% while keeping accuracy near the largest model.

Key Findings

8B→70B cascade matches 70B accuracy with much less compute

NumbersAcc: 8B→70B 83.22% vs 70B 83.57%; Reduced CC 36.46%; PD -0.35%

Practical UseUse an 8B model first and escalate to 70B when confidence is low to save ~36% compute while keeping accuracy effectively unchanged on MMLU.

Evidence RefTable 1, Table 15

70B→GPT-4o routing reduces API token usage by about 60%

NumbersReduced Tokens ≈59.96%; Acc: 70B→GPT-4o 86.85% vs GPT-4o 86.43%

Practical UsePut a 70B model as an intermediate step before calling GPT-4o to cut token/API cost by ~60% with no loss in accuracy on the evaluated tasks.

Evidence RefTable 4, Table 13

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	8B→70B 83.22%	70B 83.57%	-0.35% PD	MMLU	Table 1 shows 8B→70B 83.22% vs 70B 83.57%; reduced compute 36.46%.	Table 1
Compute reduction (8B→70B)	36.46% reduced CC	70B compute	—	MMLU	Table 1 reports 36.46% reduced computational cost for 8B→70B.	Table 1

What To Try In 7 Days

Measure P(T) with token probabilities for your multiple-choice or formatted responses and set a threshold around 0.9.

Train a small P(IK) classifier using hidden states on held-out examples to stabilize routing decisions.

Pilot one cascade (e.g., 8B→70B or 70B→GPT-4o) and compare token/compute use and end-to-end latency against always-calling the large model.

Optimization Features

Token Efficiency

Token usage reduction via cascades

Inference Optimization

Model CascadesModel RoutingEfficient InferenceToken EfficiencyCompute Cost Estimation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/NYCU-NLP-Lab/ConfDrivenInference

Data URLs

MMLUGPQAPopQA

Risks & Boundaries

Limitations

P(IK) requires labeled examples and may not generalize to very different domains.

Experiments focus on NLU and multiple-choice; generative tasks need special handling and extra evaluation.

When Not To Use

Real-time systems that cannot accept extra latency from escalations.

Domains where you cannot obtain training data for a reliable P(IK) classifier and black-box signals are weak.

Failure Modes

Small model is overconfident and retains wrong answers (overconfidence).

P(IK) classifier is conservative on OOD data, routing many queries to large models and reducing savings.

Core Entities

Models

Llama-3.2-3B-InstructMeta-Llama-3.1-8B-InstructLlama-3.3-70B-InstructQwen3-4BQwen3-8BQwen3-32BGPT-4o

Metrics

AccuracyReduced CC (GFLOPs)Tokens consumedHallucination rateMacro-F1AUROC (P(IK) classifier)

Datasets

MMLUGPQAPopQA

Benchmarks

MMLUGPQAPopQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

8B→70B cascade matches 70B accuracy with much less compute

70B→GPT-4o routing reduces API token usage by about 60%

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding