Overview
The paper shows measurable cost and judge-score gains on three benchmarks using SE-based labels, but evaluation is text-only and omits router compute/latency costs.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 3/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
Route by model uncertainty to lower cloud API spend while maintaining or improving human-preferred response quality.
Who Should Care
Summary TLDR
This paper introduces the Confidence-Driven LLM Router. It computes semantic entropy (SE)—an uncertainty score that clusters semantically equivalent outputs—to decide when to keep answers on a small on-device model versus call a larger cloud LLM. SE generates preference labels used to train lightweight routers (kNN, SW, MF, MLP). On MT-Bench, GSM8K and MMLU the method reduces needed strong-model calls (lower CPT) and slightly raises LLM-as-a-judge ratings. Evaluations are text-only and do not measure router compute overhead.
Problem Statement
Edge-cloud deployments must balance API/cloud cost against response quality. Human preference labels are costly and noisy; binary accuracy ignores confidence. We need a cheap, reliable signal that tells when to offload to a stronger model.
Main Contribution
Confidence-Driven LLM Router: use semantic entropy (SE) as a routing signal to decide on-device vs cloud calls.
Practical pipeline: cluster outputs with a bidirectional entailment classifier, compute SE, turn SE differences into preference labels, and train lightweight routers.
Key Findings
SE-based routing greatly reduces strong-model calls on MT-Bench
Lower overall API cost for same target improvement
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| CPT(50%) | 27.31% | Random 51.29% | −23.98 pp | MT-Bench | Confidence-Driven (SW) CPT(50%) = 27.31 in Table 1 | Table 1 |
| API cost (MT-Bench, CPT(80%)) | $3.74 | Random $4.06 | −$0.32 | MT-Bench | Reported USD costs in section 3.2 | Section 3.2 |
What To Try In 7 Days
Compute semantic entropy: cluster model outputs using an entailment classifier and measure cluster probability entropy.
Build SE-based preference labels with a tunable threshold tau to mark ties.
Train a lightweight router (kNN, SW, or small MLP) on embeddings and test CPT(50/80) targets to measure cost trade-offs.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation is limited to text queries; multimodal routing not studied.
Computational overhead and latency of router architectures are not analyzed.
When Not To Use
When inputs are multimodal (images + text) without validating SE for those modalities.
When router compute or latency would negate savings from fewer cloud calls.
Failure Modes
Entailment classifier mis-clustering causes wrong SE and misroutes.
Threshold tau miscalibration leads to too many or too few cloud calls.

