Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

February 16, 20255 min

Overview

Decision SnapshotNeeds Validation

The paper shows measurable cost and judge-score gains on three benchmarks using SE-based labels, but evaluation is text-only and omits router compute/latency costs.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Tuo Zhang, Asal Mehradfar, Dimitrios Dimitriadis, Salman Avestimehr

Links

Abstract / PDF / Data

Why It Matters For Business

Route by model uncertainty to lower cloud API spend while maintaining or improving human-preferred response quality.

Who Should Care

Summary TLDR

This paper introduces the Confidence-Driven LLM Router. It computes semantic entropy (SE)—an uncertainty score that clusters semantically equivalent outputs—to decide when to keep answers on a small on-device model versus call a larger cloud LLM. SE generates preference labels used to train lightweight routers (kNN, SW, MF, MLP). On MT-Bench, GSM8K and MMLU the method reduces needed strong-model calls (lower CPT) and slightly raises LLM-as-a-judge ratings. Evaluations are text-only and do not measure router compute overhead.

Problem Statement

Edge-cloud deployments must balance API/cloud cost against response quality. Human preference labels are costly and noisy; binary accuracy ignores confidence. We need a cheap, reliable signal that tells when to offload to a stronger model.

Main Contribution

Confidence-Driven LLM Router: use semantic entropy (SE) as a routing signal to decide on-device vs cloud calls.

Practical pipeline: cluster outputs with a bidirectional entailment classifier, compute SE, turn SE differences into preference labels, and train lightweight routers.

Key Findings

SE-based routing greatly reduces strong-model calls on MT-Bench

NumbersCPT(50%) = 27.31% (Confidence SW) vs Random 51.29%

Practical UseRouting by uncertainty can cut cloud usage roughly in half on MT-Bench-style tasks; implement SE+SW to reduce API costs.

Evidence RefTable 1

Lower overall API cost for same target improvement

NumbersMT-Bench CPT(80%) cost $3.74 vs Random $4.06, TO-Router $3.88, RouteLLM $4.04

Practical UseUsing SE-based routing saved about $0.12–$0.32 per evaluated batch; useful where API spend is measurable.

Evidence RefSection 3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
CPT(50%)27.31%Random 51.29%−23.98 ppMT-BenchConfidence-Driven (SW) CPT(50%) = 27.31 in Table 1Table 1
API cost (MT-Bench, CPT(80%))$3.74Random $4.06$0.32MT-BenchReported USD costs in section 3.2Section 3.2

What To Try In 7 Days

Compute semantic entropy: cluster model outputs using an entailment classifier and measure cluster probability entropy.

Build SE-based preference labels with a tunable threshold tau to mark ties.

Train a lightweight router (kNN, SW, or small MLP) on embeddings and test CPT(50/80) targets to measure cost trade-offs.

Optimization Features

Infra Optimization
Reduces API call volume
Model Optimization
Model Routing
System Optimization
Edge-cloud offloading decision
Training Optimization
SE-based preference labeling
Inference Optimization
Cost-aware routingEfficient Inference

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation is limited to text queries; multimodal routing not studied.

Computational overhead and latency of router architectures are not analyzed.

When Not To Use

When inputs are multimodal (images + text) without validating SE for those modalities.

When router compute or latency would negate savings from fewer cloud calls.

Failure Modes

Entailment classifier mis-clustering causes wrong SE and misroutes.

Threshold tau miscalibration leads to too many or too few cloud calls.

Core Entities

Models

GPT-4Mixtral-8x7BGPT-o1DeBERTa-large

Metrics

CPT(50%)CPT(80%)LLM-as-a-Judge scoreUSD API cost

Datasets

NaturalQATriviaQAPopQAMAWPSMMLUMT-BenchGSM8KChatbot Arena

Benchmarks

MT-BenchGSM8KMMLU