Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

February 16, 20255 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

0

Authors

Tuo Zhang, Asal Mehradfar, Dimitrios Dimitriadis, Salman Avestimehr

Links

Abstract / PDF

Why It Matters For Business

Route by model uncertainty to lower cloud API spend while maintaining or improving human-preferred response quality.

Summary TLDR

This paper introduces the Confidence-Driven LLM Router. It computes semantic entropy (SE)—an uncertainty score that clusters semantically equivalent outputs—to decide when to keep answers on a small on-device model versus call a larger cloud LLM. SE generates preference labels used to train lightweight routers (kNN, SW, MF, MLP). On MT-Bench, GSM8K and MMLU the method reduces needed strong-model calls (lower CPT) and slightly raises LLM-as-a-judge ratings. Evaluations are text-only and do not measure router compute overhead.

Problem Statement

Edge-cloud deployments must balance API/cloud cost against response quality. Human preference labels are costly and noisy; binary accuracy ignores confidence. We need a cheap, reliable signal that tells when to offload to a stronger model.

Main Contribution

Confidence-Driven LLM Router: use semantic entropy (SE) as a routing signal to decide on-device vs cloud calls.

Practical pipeline: cluster outputs with a bidirectional entailment classifier, compute SE, turn SE differences into preference labels, and train lightweight routers.

Evaluation on MT-Bench, GSM8K and MMLU showing lower cost (CPT/USD) and higher LLM-judge quality than prior routers.

Release of the SE-built dataset on Hugging Face for reproducibility of training data.

Key Findings

SE-based routing greatly reduces strong-model calls on MT-Bench

NumbersCPT(50%) = 27.31% (Confidence SW) vs Random 51.29%

Lower overall API cost for same target improvement

NumbersMT-Bench CPT(80%) cost $3.74 vs Random $4.06, TO-Router $3.88, RouteLLM $4.04

Responses judged more human-preferable under SE routing

NumbersGSM8K LLM-judge CPT(80%) = 89.21 vs RouteLLM 88.88 and TO-Router 85.97

Human-preference training data can be data-inefficient and noisy

Results

CPT(50%)

Value27.31%

BaselineRandom 51.29%

API cost (MT-Bench, CPT(80%))

Value$3.74

BaselineRandom $4.06

LLM-as-a-Judge score (CPT(80%))

Value89.21

BaselineRouteLLM 88.88

Who Should Care

What To Try In 7 Days

Compute semantic entropy: cluster model outputs using an entailment classifier and measure cluster probability entropy.

Build SE-based preference labels with a tunable threshold tau to mark ties.

Train a lightweight router (kNN, SW, or small MLP) on embeddings and test CPT(50/80) targets to measure cost trade-offs.

Optimization Features

Infra Optimization

  • Reduces API call volume

Model Optimization

  • Model Routing

System Optimization

  • Edge-cloud offloading decision

Training Optimization

  • SE-based preference labeling

Inference Optimization

  • Cost-aware routing
  • Efficient Inference

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation is limited to text queries; multimodal routing not studied.
  • Computational overhead and latency of router architectures are not analyzed.
  • SE depends on clustering and an entailment classifier; errors there affect routing.

When Not To Use

  • When inputs are multimodal (images + text) without validating SE for those modalities.
  • When router compute or latency would negate savings from fewer cloud calls.

Failure Modes

  • Entailment classifier mis-clustering causes wrong SE and misroutes.
  • Threshold tau miscalibration leads to too many or too few cloud calls.
  • LLM-as-a-judge bias may not equal real human preferences in production.

Core Entities

Models

  • GPT-4
  • Mixtral-8x7B
  • GPT-o1
  • DeBERTa-large

Metrics

  • CPT(50%)
  • CPT(80%)
  • LLM-as-a-Judge score
  • USD API cost

Datasets

  • NaturalQA
  • TriviaQA
  • PopQA
  • MAWPS
  • MMLU
  • MT-Bench
  • GSM8K
  • Chatbot Arena

Benchmarks

  • MT-Bench
  • GSM8K
  • MMLU