Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Route by model uncertainty to lower cloud API spend while maintaining or improving human-preferred response quality.
Summary TLDR
This paper introduces the Confidence-Driven LLM Router. It computes semantic entropy (SE)—an uncertainty score that clusters semantically equivalent outputs—to decide when to keep answers on a small on-device model versus call a larger cloud LLM. SE generates preference labels used to train lightweight routers (kNN, SW, MF, MLP). On MT-Bench, GSM8K and MMLU the method reduces needed strong-model calls (lower CPT) and slightly raises LLM-as-a-judge ratings. Evaluations are text-only and do not measure router compute overhead.
Problem Statement
Edge-cloud deployments must balance API/cloud cost against response quality. Human preference labels are costly and noisy; binary accuracy ignores confidence. We need a cheap, reliable signal that tells when to offload to a stronger model.
Main Contribution
Confidence-Driven LLM Router: use semantic entropy (SE) as a routing signal to decide on-device vs cloud calls.
Practical pipeline: cluster outputs with a bidirectional entailment classifier, compute SE, turn SE differences into preference labels, and train lightweight routers.
Evaluation on MT-Bench, GSM8K and MMLU showing lower cost (CPT/USD) and higher LLM-judge quality than prior routers.
Release of the SE-built dataset on Hugging Face for reproducibility of training data.
Key Findings
SE-based routing greatly reduces strong-model calls on MT-Bench
Lower overall API cost for same target improvement
Responses judged more human-preferable under SE routing
Human-preference training data can be data-inefficient and noisy
Results
CPT(50%)
API cost (MT-Bench, CPT(80%))
LLM-as-a-Judge score (CPT(80%))
Who Should Care
What To Try In 7 Days
Compute semantic entropy: cluster model outputs using an entailment classifier and measure cluster probability entropy.
Build SE-based preference labels with a tunable threshold tau to mark ties.
Train a lightweight router (kNN, SW, or small MLP) on embeddings and test CPT(50/80) targets to measure cost trade-offs.
Optimization Features
Infra Optimization
- Reduces API call volume
Model Optimization
- Model Routing
System Optimization
- Edge-cloud offloading decision
Training Optimization
- SE-based preference labeling
Inference Optimization
- Cost-aware routing
- Efficient Inference
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation is limited to text queries; multimodal routing not studied.
- Computational overhead and latency of router architectures are not analyzed.
- SE depends on clustering and an entailment classifier; errors there affect routing.
When Not To Use
- When inputs are multimodal (images + text) without validating SE for those modalities.
- When router compute or latency would negate savings from fewer cloud calls.
Failure Modes
- Entailment classifier mis-clustering causes wrong SE and misroutes.
- Threshold tau miscalibration leads to too many or too few cloud calls.
- LLM-as-a-judge bias may not equal real human preferences in production.
Core Entities
Models
- GPT-4
- Mixtral-8x7B
- GPT-o1
- DeBERTa-large
Metrics
- CPT(50%)
- CPT(80%)
- LLM-as-a-Judge score
- USD API cost
Datasets
- NaturalQA
- TriviaQA
- PopQA
- MAWPS
- MMLU
- MT-Bench
- GSM8K
- Chatbot Arena
Benchmarks
- MT-Bench
- GSM8K
- MMLU

