Overview
The method uses well-known building blocks (IRT, PPO, SetTransformer) and public benchmarks. Experiments show consistent cost gains across datasets, but improvements are benchmark-scoped and there remains a gap to an oracle policy.
Citations0
Evidence Strength0.60
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
You can cut model invocation spend substantially (up to ~27% on benchmarks) without retraining routing logic, onboard new models with 20–50 tests, and add routing with only milliseconds of overhead.
Who Should Care
Summary TLDR
This paper frames per-query LLM selection as a multi-objective bandit: pick models to maximize quality while minimizing cost. It learns compact model identity vectors (via a variational IRT on prompt embeddings), trains a single preference-conditioned stochastic routing policy (PPO + SetTransformer) that generalizes across model sets and user trade-offs, and supports fast cold-starts by “quizzing” a new model on 20–50 selected prompts. On public benchmarks (HELM, AlpacaEval, OpenLLM), the system cuts evaluated inference cost up to ~27% versus prior routing while keeping similar accuracy. Routing adds ≈5ms and <100MB memory.
Problem Statement
Choosing an LLM per query requires trading accuracy against invocation cost, adapting to different user preferences, and onboarding new models quickly. Existing ensembles or cascades either raise cost/latency or need retraining and don’t generalize to changing model pools.
Main Contribution
Formulate LLM selection as a multi-objective contextual bandit conditioned on user preferences.
Learn compact model identity vectors using a variational Item Response Theory (IRT) model over prompt embeddings.
Key Findings
Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.
New models can be integrated with a small quizzing budget and still get near-full performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Cost reduction vs RouteLLM | up to 27% lower cost on MMLU | RouteLLM | 27% lower cost | MMLU | Main results text and Fig.2 | Results section; Fig.2 |
| AlpacaEval GPT4/Mixtral trade-off | 46.35% accuracy at $31 | RouteLLM $35 | 11% cost reduction | AlpacaEval 2.0 (GPT4/Mixtral) | Results paragraph | Results section; Fig.2 |
What To Try In 7 Days
Compute identity vectors for your current LLM pool using a small benchmark subset (20–50 prompts per model).
Implement a simple preference-conditioned selector that inputs predicted scores and per-model cost to route queries.
A/B test routing vs your current policy and measure cost per successful response and latency impact (expect ~5ms overhead).
Agent Features
Frameworks
Architectures
Optimization Features
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Assumes fixed per-model cost; real costs vary with input length and compute.
Relies on precomputed evaluation scores and offline training data; online adaptation is not implemented.
When Not To Use
When per-query costs vary widely with input length and you cannot approximate cost by a fixed per-model number.
When you need fully online learning from live user feedback (paper uses offline evaluation scores).
Failure Modes
Poor identity vectors if the quizzing prompts are unrepresentative, causing bad routing.
Miscalibrated score predictions leading to systematic selection of suboptimal models.

