Preference-conditioned bandit routing that picks the most cost-effective LLM per query

Overview

Decision SnapshotNeeds Validation

The method uses well-known building blocks (IRT, PPO, SetTransformer) and public benchmarks. Experiments show consistent cost gains across datasets, but improvements are benchmark-scoped and there remains a gap to an oracle policy.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Yang Li

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut model invocation spend substantially (up to ~27% on benchmarks) without retraining routing logic, onboard new models with 20–50 tests, and add routing with only milliseconds of overhead.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper frames per-query LLM selection as a multi-objective bandit: pick models to maximize quality while minimizing cost. It learns compact model identity vectors (via a variational IRT on prompt embeddings), trains a single preference-conditioned stochastic routing policy (PPO + SetTransformer) that generalizes across model sets and user trade-offs, and supports fast cold-starts by “quizzing” a new model on 20–50 selected prompts. On public benchmarks (HELM, AlpacaEval, OpenLLM), the system cuts evaluated inference cost up to ~27% versus prior routing while keeping similar accuracy. Routing adds ≈5ms and <100MB memory.

Problem Statement

Choosing an LLM per query requires trading accuracy against invocation cost, adapting to different user preferences, and onboarding new models quickly. Existing ensembles or cascades either raise cost/latency or need retraining and don’t generalize to changing model pools.

Main Contribution

Formulate LLM selection as a multi-objective contextual bandit conditioned on user preferences.

Learn compact model identity vectors using a variational Item Response Theory (IRT) model over prompt embeddings.

Key Findings

Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.

NumbersUp to 27% cost reduction (e.g., MMLU) and 11% reduction on AlpacaEval GPT4/Mixtral setting

Practical UseDeploy the routing policy to cut LLM spend on mixed workloads while keeping model quality on par with baselines.

Evidence RefResults; Fig.2 and text (AlpacaEval, MMLU comparisons)

New models can be integrated with a small quizzing budget and still get near-full performance.

NumbersIdentity vectors estimated from 20–50 prompts; integration overhead reduced ≈90%

Practical UseOnboard new LLMs in minutes by evaluating only ~20–50 targeted prompts instead of full benchmarks.

Evidence RefSection 2.6 and Fig.3 (cold-start experiments)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Cost reduction vs RouteLLM	up to 27% lower cost on MMLU	RouteLLM	27% lower cost	MMLU	Main results text and Fig.2	Results section; Fig.2
AlpacaEval GPT4/Mixtral trade-off	46.35% accuracy at $31	RouteLLM $35	11% cost reduction	AlpacaEval 2.0 (GPT4/Mixtral)	Results paragraph	Results section; Fig.2

What To Try In 7 Days

Compute identity vectors for your current LLM pool using a small benchmark subset (20–50 prompts per model).

Implement a simple preference-conditioned selector that inputs predicted scores and per-model cost to route queries.

A/B test routing vs your current policy and measure cost per successful response and latency impact (expect ~5ms overhead).

Agent Features

Frameworks

multi-objective PPOvariational IRT score predictor

Architectures

stochastic preference-conditioned policySetTransformer for permutation-invariant context

Optimization Features

System Optimization

cheap routing (≈5ms) and small memory footprint (<100MB)

Training Optimization

supervised pretraining on pairwise winson-manifold mixup regularizationreward normalization across model sets

Inference Optimization

single-model routing (no cascade) to avoid extra inferencesaction-space aware policy for arbitrary model sets

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://crfm.stanford.edu/helm/https://tatsu-lab.github.io/alpaca_eval/https://huggingface.co/datasets/berkeley-nest/Nectar https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k

Risks & Boundaries

Limitations

Assumes fixed per-model cost; real costs vary with input length and compute.

Relies on precomputed evaluation scores and offline training data; online adaptation is not implemented.

When Not To Use

When per-query costs vary widely with input length and you cannot approximate cost by a fixed per-model number.

When you need fully online learning from live user feedback (paper uses offline evaluation scores).

Failure Modes

Poor identity vectors if the quizzing prompts are unrepresentative, causing bad routing.

Miscalibrated score predictions leading to systematic selection of suboptimal models.

Core Entities

Models

gpt-4gpt-3.5-turbomixtral-8x7bclaude-3-opusmistral-7bllama-3-8b

Metrics

Accuracypairwise win ratecost ($ per 1M input+output tokens)

Datasets

HELM-LiteHELM-MMLUAlpacaEval 2.0OpenLLM LeaderboardOpenLLMv2MT-BenchNectarChatbot Arena

Benchmarks

HELM-LiteAlpacaEval 2.0OpenLLM LeaderboardMMLUMT-Bench

Context Entities

Models

gpt-4oclaude-2.1mistral-mediummixtral-8x22bllama-3-70b

Datasets

OpenLLM Leaderboard v2RouteLLM synthetic comparisons

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.

New models can be integrated with a small quizzing budget and still get near-full performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

Key finding

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Key finding

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Key finding

RouterEval: a 200M-record benchmark showing router-based model routing can scale LLM performance by combining many weak models

Key finding

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Key finding