Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can cut model invocation spend substantially (up to ~27% on benchmarks) without retraining routing logic, onboard new models with 20–50 tests, and add routing with only milliseconds of overhead.
Summary TLDR
This paper frames per-query LLM selection as a multi-objective bandit: pick models to maximize quality while minimizing cost. It learns compact model identity vectors (via a variational IRT on prompt embeddings), trains a single preference-conditioned stochastic routing policy (PPO + SetTransformer) that generalizes across model sets and user trade-offs, and supports fast cold-starts by “quizzing” a new model on 20–50 selected prompts. On public benchmarks (HELM, AlpacaEval, OpenLLM), the system cuts evaluated inference cost up to ~27% versus prior routing while keeping similar accuracy. Routing adds ≈5ms and <100MB memory.
Problem Statement
Choosing an LLM per query requires trading accuracy against invocation cost, adapting to different user preferences, and onboarding new models quickly. Existing ensembles or cascades either raise cost/latency or need retraining and don’t generalize to changing model pools.
Main Contribution
Formulate LLM selection as a multi-objective contextual bandit conditioned on user preferences.
Learn compact model identity vectors using a variational Item Response Theory (IRT) model over prompt embeddings.
Train a single preference-conditioned, action-space-aware stochastic policy that generalizes to arbitrary model sets and user trade-offs.
Fast cold-start method: compute a new model's identity vector using 20–50 discriminative prompts, cutting integration overhead ≈90%.
Practical training recipe: supervised pretraining on pairwise comparisons, reward normalization, on-manifold mixup, and multi-objective PPO.
Key Findings
Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.
New models can be integrated with a small quizzing budget and still get near-full performance.
Routing adds negligible runtime and memory cost.
Results
Cost reduction vs RouteLLM
AlpacaEval GPT4/Mixtral trade-off
Cold-start quizzing budget
Who Should Care
What To Try In 7 Days
Compute identity vectors for your current LLM pool using a small benchmark subset (20–50 prompts per model).
Implement a simple preference-conditioned selector that inputs predicted scores and per-model cost to route queries.
A/B test routing vs your current policy and measure cost per successful response and latency impact (expect ~5ms overhead).
Agent Features
Frameworks
- multi-objective PPO
- variational IRT score predictor
Architectures
- stochastic preference-conditioned policy
- SetTransformer for permutation-invariant context
Optimization Features
System Optimization
- cheap routing (≈5ms) and small memory footprint (<100MB)
Training Optimization
- supervised pretraining on pairwise wins
- on-manifold mixup regularization
- reward normalization across model sets
Inference Optimization
- single-model routing (no cascade) to avoid extra inferences
- action-space aware policy for arbitrary model sets
Reproducibility
Data Urls
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Assumes fixed per-model cost; real costs vary with input length and compute.
- Relies on precomputed evaluation scores and offline training data; online adaptation is not implemented.
- Preference vector ω is numeric and may be unintuitive for non-expert users.
- Benchmarks may not capture all real-world query distributions or multi-turn contexts.
When Not To Use
- When per-query costs vary widely with input length and you cannot approximate cost by a fixed per-model number.
- When you need fully online learning from live user feedback (paper uses offline evaluation scores).
- For extremely latency-sensitive systems that cannot tolerate extra routing steps, however small.
Failure Modes
- Poor identity vectors if the quizzing prompts are unrepresentative, causing bad routing.
- Miscalibrated score predictions leading to systematic selection of suboptimal models.
- Routing may amplify biases present in underlying models by favoring cheaper but biased models.
Core Entities
Models
- gpt-4
- gpt-3.5-turbo
- mixtral-8x7b
- claude-3-opus
- mistral-7b
- llama-3-8b
Metrics
- Accuracy
- pairwise win rate
- cost ($ per 1M input+output tokens)
Datasets
- HELM-Lite
- HELM-MMLU
- AlpacaEval 2.0
- OpenLLM Leaderboard
- OpenLLMv2
- MT-Bench
- Nectar
- Chatbot Arena
Benchmarks
- HELM-Lite
- AlpacaEval 2.0
- OpenLLM Leaderboard
- MMLU
- MT-Bench
Context Entities
Models
- gpt-4o
- claude-2.1
- mistral-medium
- mixtral-8x22b
- llama-3-70b
Datasets
- OpenLLM Leaderboard v2
- RouteLLM synthetic comparisons

