Use token-level and hidden-state confidence to route queries to smaller models and cut inference cost with little accuracy loss

February 25, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

0

Authors

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen

Links

Abstract / PDF

Why It Matters For Business

You can cut API and compute bills substantially by asking a small model if it 'knows' and how confident its token choice is, then only escalate uncertain queries to a bigger model.

Summary TLDR

The paper introduces a practical routing method that asks a small model if it 'knows' (P(IK)) and how confident its chosen answer token is (P(T)). If either score is low, the query is escalated to a larger model. On MMLU this saves about 20–40% compute versus always using the largest model while keeping accuracy nearly the same (example: 8B→70B 83.22% vs 70B 83.57%). Applied to GPT-4o, a 70B→GPT-4o cascade cut token use by ≈60% with comparable accuracy. P(T) needs only token probabilities; P(IK) is a small classifier built from hidden states.

Problem Statement

Large LLMs give better answers but cost a lot. Running the biggest model for every query is expensive. We need a cheap, reliable way to decide when a small model is good enough and when to call a bigger model.

Main Contribution

A practical two-signal routing rule using P(T) (token-prob confidence) and P(IK) (a classifier on hidden states) to decide escalation between models.

Empirical demonstration on MMLU and PopQA that cascades (e.g., 8B→70B) cut compute or token cost 20–60% while keeping accuracy near the largest model.

Ablations and OOD tests (GPQA) showing P(IK) improves stability and that P(T)-only routing still works when internal states are unavailable.

Key Findings

8B→70B cascade matches 70B accuracy with much less compute

NumbersAcc: 8B→70B 83.22% vs 70B 83.57%; Reduced CC 36.46%; PD -0.35%

70B→GPT-4o routing reduces API token usage by about 60%

NumbersReduced Tokens ≈59.96%; Acc: 70B→GPT-4o 86.85% vs GPT-4o 86.43%

Training a P(IK) classifier improves routing stability

Numbers8B P(IK) classifier AUROC 81.78%; Ablation: 8B→70B PD w/ P(IK) -0.35% vs -1.26% w/o

Method generalizes but is conservative on hard OOD data

NumbersGPQA: 8B→70B Acc 51.79% vs 70B 52.23%; Reduced CC 3.93%

Open-ended QA shows cost-accuracy trade-offs and reduced hallucination with cascades

NumbersPopQA: 8B→70B Acc 0.6459 (PD -1.91%), Hallucination 0.3422, Reduced Cost 7.11%

Results

Accuracy

Value8B→70B 83.22%

Baseline70B 83.57%

Compute reduction (8B→70B)

Value36.46% reduced CC

Baseline70B compute

GPT-4o token savings (70B→GPT-4o)

Value59.96% reduced tokens

BaselineGPT-4o alone

P(IK) classifier performance (8B)

ValueAccuracy 76.50%, AUROC 81.78%

Accuracy

ValueAcc 0.6459, Hallucination 0.3422, Reduced Cost 7.11%

Baseline70B Acc 0.6585

Who Should Care

What To Try In 7 Days

Measure P(T) with token probabilities for your multiple-choice or formatted responses and set a threshold around 0.9.

Train a small P(IK) classifier using hidden states on held-out examples to stabilize routing decisions.

Pilot one cascade (e.g., 8B→70B or 70B→GPT-4o) and compare token/compute use and end-to-end latency against always-calling the large model.

Optimization Features

Token Efficiency

  • Token usage reduction via cascades

Inference Optimization

  • Model Cascades
  • Model Routing
  • Efficient Inference
  • Token Efficiency
  • Compute Cost Estimation

Reproducibility

Data Urls

  • MMLU
  • GPQA
  • PopQA

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • P(IK) requires labeled examples and may not generalize to very different domains.
  • Experiments focus on NLU and multiple-choice; generative tasks need special handling and extra evaluation.
  • Cascading can add latency for queries that escalate through multiple models.

When Not To Use

  • Real-time systems that cannot accept extra latency from escalations.
  • Domains where you cannot obtain training data for a reliable P(IK) classifier and black-box signals are weak.
  • Workflows where token pricing or latency makes multi-call cascades more expensive than single large-model calls.

Failure Modes

  • Small model is overconfident and retains wrong answers (overconfidence).
  • P(IK) classifier is conservative on OOD data, routing many queries to large models and reducing savings.
  • Error accumulation in long cascades (e.g., 3B→8B→70B) causes larger performance drops.

Core Entities

Models

  • Llama-3.2-3B-Instruct
  • Meta-Llama-3.1-8B-Instruct
  • Llama-3.3-70B-Instruct
  • Qwen3-4B
  • Qwen3-8B
  • Qwen3-32B
  • GPT-4o

Metrics

  • Accuracy
  • Reduced CC (GFLOPs)
  • Tokens consumed
  • Hallucination rate
  • Macro-F1
  • AUROC (P(IK) classifier)

Datasets

  • MMLU
  • GPQA
  • PopQA

Benchmarks

  • MMLU
  • GPQA
  • PopQA