Preference-conditioned bandit routing that picks the most cost-effective LLM per query

February 4, 20257 min

Overview

Decision SnapshotNeeds Validation

The method uses well-known building blocks (IRT, PPO, SetTransformer) and public benchmarks. Experiments show consistent cost gains across datasets, but improvements are benchmark-scoped and there remains a gap to an oracle policy.

Citations0

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 70%

Authors

Yang Li

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut model invocation spend substantially (up to ~27% on benchmarks) without retraining routing logic, onboard new models with 20–50 tests, and add routing with only milliseconds of overhead.

Who Should Care

Summary TLDR

This paper frames per-query LLM selection as a multi-objective bandit: pick models to maximize quality while minimizing cost. It learns compact model identity vectors (via a variational IRT on prompt embeddings), trains a single preference-conditioned stochastic routing policy (PPO + SetTransformer) that generalizes across model sets and user trade-offs, and supports fast cold-starts by “quizzing” a new model on 20–50 selected prompts. On public benchmarks (HELM, AlpacaEval, OpenLLM), the system cuts evaluated inference cost up to ~27% versus prior routing while keeping similar accuracy. Routing adds ≈5ms and <100MB memory.

Problem Statement

Choosing an LLM per query requires trading accuracy against invocation cost, adapting to different user preferences, and onboarding new models quickly. Existing ensembles or cascades either raise cost/latency or need retraining and don’t generalize to changing model pools.

Main Contribution

Formulate LLM selection as a multi-objective contextual bandit conditioned on user preferences.

Learn compact model identity vectors using a variational Item Response Theory (IRT) model over prompt embeddings.

Key Findings

Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.

NumbersUp to 27% cost reduction (e.g., MMLU) and 11% reduction on AlpacaEval GPT4/Mixtral setting

Practical UseDeploy the routing policy to cut LLM spend on mixed workloads while keeping model quality on par with baselines.

Evidence RefResults; Fig.2 and text (AlpacaEval, MMLU comparisons)

New models can be integrated with a small quizzing budget and still get near-full performance.

NumbersIdentity vectors estimated from 2050 prompts; integration overhead reduced ≈90%

Practical UseOnboard new LLMs in minutes by evaluating only ~20–50 targeted prompts instead of full benchmarks.

Evidence RefSection 2.6 and Fig.3 (cold-start experiments)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Cost reduction vs RouteLLMup to 27% lower cost on MMLURouteLLM27% lower costMMLUMain results text and Fig.2Results section; Fig.2
AlpacaEval GPT4/Mixtral trade-off46.35% accuracy at $31RouteLLM $3511% cost reductionAlpacaEval 2.0 (GPT4/Mixtral)Results paragraphResults section; Fig.2

What To Try In 7 Days

Compute identity vectors for your current LLM pool using a small benchmark subset (20–50 prompts per model).

Implement a simple preference-conditioned selector that inputs predicted scores and per-model cost to route queries.

A/B test routing vs your current policy and measure cost per successful response and latency impact (expect ~5ms overhead).

Agent Features

Frameworks
multi-objective PPOvariational IRT score predictor
Architectures
stochastic preference-conditioned policySetTransformer for permutation-invariant context

Optimization Features

System Optimization
cheap routing (≈5ms) and small memory footprint (<100MB)
Training Optimization
supervised pretraining on pairwise winson-manifold mixup regularizationreward normalization across model sets
Inference Optimization
single-model routing (no cascade) to avoid extra inferencesaction-space aware policy for arbitrary model sets

Reproducibility

Risks & Boundaries

Limitations

Assumes fixed per-model cost; real costs vary with input length and compute.

Relies on precomputed evaluation scores and offline training data; online adaptation is not implemented.

When Not To Use

When per-query costs vary widely with input length and you cannot approximate cost by a fixed per-model number.

When you need fully online learning from live user feedback (paper uses offline evaluation scores).

Failure Modes

Poor identity vectors if the quizzing prompts are unrepresentative, causing bad routing.

Miscalibrated score predictions leading to systematic selection of suboptimal models.

Core Entities

Models

gpt-4gpt-3.5-turbomixtral-8x7bclaude-3-opusmistral-7bllama-3-8b

Metrics

Accuracypairwise win ratecost ($ per 1M input+output tokens)

Datasets

HELM-LiteHELM-MMLUAlpacaEval 2.0OpenLLM LeaderboardOpenLLMv2MT-BenchNectarChatbot Arena

Benchmarks

HELM-LiteAlpacaEval 2.0OpenLLM LeaderboardMMLUMT-Bench

Context Entities

Models

gpt-4oclaude-2.1mistral-mediummixtral-8x22bllama-3-70b

Datasets

OpenLLM Leaderboard v2RouteLLM synthetic comparisons