Preference-conditioned bandit routing that picks the most cost-effective LLM per query

February 4, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

0

Authors

Yang Li

Links

Abstract / PDF

Why It Matters For Business

You can cut model invocation spend substantially (up to ~27% on benchmarks) without retraining routing logic, onboard new models with 20–50 tests, and add routing with only milliseconds of overhead.

Summary TLDR

This paper frames per-query LLM selection as a multi-objective bandit: pick models to maximize quality while minimizing cost. It learns compact model identity vectors (via a variational IRT on prompt embeddings), trains a single preference-conditioned stochastic routing policy (PPO + SetTransformer) that generalizes across model sets and user trade-offs, and supports fast cold-starts by “quizzing” a new model on 20–50 selected prompts. On public benchmarks (HELM, AlpacaEval, OpenLLM), the system cuts evaluated inference cost up to ~27% versus prior routing while keeping similar accuracy. Routing adds ≈5ms and <100MB memory.

Problem Statement

Choosing an LLM per query requires trading accuracy against invocation cost, adapting to different user preferences, and onboarding new models quickly. Existing ensembles or cascades either raise cost/latency or need retraining and don’t generalize to changing model pools.

Main Contribution

Formulate LLM selection as a multi-objective contextual bandit conditioned on user preferences.

Learn compact model identity vectors using a variational Item Response Theory (IRT) model over prompt embeddings.

Train a single preference-conditioned, action-space-aware stochastic policy that generalizes to arbitrary model sets and user trade-offs.

Fast cold-start method: compute a new model's identity vector using 20–50 discriminative prompts, cutting integration overhead ≈90%.

Practical training recipe: supervised pretraining on pairwise comparisons, reward normalization, on-manifold mixup, and multi-objective PPO.

Key Findings

Routing reduces inference cost on evaluated benchmarks while keeping similar accuracy.

NumbersUp to 27% cost reduction (e.g., MMLU) and 11% reduction on AlpacaEval GPT4/Mixtral setting

New models can be integrated with a small quizzing budget and still get near-full performance.

NumbersIdentity vectors estimated from 20–50 prompts; integration overhead reduced ≈90%

Routing adds negligible runtime and memory cost.

Numbers≈5ms per routing decision and <100MB GPU memory

Results

Cost reduction vs RouteLLM

Valueup to 27% lower cost on MMLU

BaselineRouteLLM

AlpacaEval GPT4/Mixtral trade-off

Value46.35% accuracy at $31

BaselineRouteLLM $35

Cold-start quizzing budget

Valuenear-full routing performance with 50 prompts

Baselinefull benchmark evaluation

Who Should Care

What To Try In 7 Days

Compute identity vectors for your current LLM pool using a small benchmark subset (20–50 prompts per model).

Implement a simple preference-conditioned selector that inputs predicted scores and per-model cost to route queries.

A/B test routing vs your current policy and measure cost per successful response and latency impact (expect ~5ms overhead).

Agent Features

Frameworks

  • multi-objective PPO
  • variational IRT score predictor

Architectures

  • stochastic preference-conditioned policy
  • SetTransformer for permutation-invariant context

Optimization Features

System Optimization

  • cheap routing (≈5ms) and small memory footprint (<100MB)

Training Optimization

  • supervised pretraining on pairwise wins
  • on-manifold mixup regularization
  • reward normalization across model sets

Inference Optimization

  • single-model routing (no cascade) to avoid extra inferences
  • action-space aware policy for arbitrary model sets

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Assumes fixed per-model cost; real costs vary with input length and compute.
  • Relies on precomputed evaluation scores and offline training data; online adaptation is not implemented.
  • Preference vector ω is numeric and may be unintuitive for non-expert users.
  • Benchmarks may not capture all real-world query distributions or multi-turn contexts.

When Not To Use

  • When per-query costs vary widely with input length and you cannot approximate cost by a fixed per-model number.
  • When you need fully online learning from live user feedback (paper uses offline evaluation scores).
  • For extremely latency-sensitive systems that cannot tolerate extra routing steps, however small.

Failure Modes

  • Poor identity vectors if the quizzing prompts are unrepresentative, causing bad routing.
  • Miscalibrated score predictions leading to systematic selection of suboptimal models.
  • Routing may amplify biases present in underlying models by favoring cheaper but biased models.

Core Entities

Models

  • gpt-4
  • gpt-3.5-turbo
  • mixtral-8x7b
  • claude-3-opus
  • mistral-7b
  • llama-3-8b

Metrics

  • Accuracy
  • pairwise win rate
  • cost ($ per 1M input+output tokens)

Datasets

  • HELM-Lite
  • HELM-MMLU
  • AlpacaEval 2.0
  • OpenLLM Leaderboard
  • OpenLLMv2
  • MT-Bench
  • Nectar
  • Chatbot Arena

Benchmarks

  • HELM-Lite
  • AlpacaEval 2.0
  • OpenLLM Leaderboard
  • MMLU
  • MT-Bench

Context Entities

Models

  • gpt-4o
  • claude-2.1
  • mistral-medium
  • mixtral-8x22b
  • llama-3-70b

Datasets

  • OpenLLM Leaderboard v2
  • RouteLLM synthetic comparisons