Overview
The approach is simple and deployable: embed inputs, predict a linear reward (correctness − p·cost), and use UCB to choose models; experiments across two datasets and two API providers show consistent cost-accuracy benefits.
Citations2
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
MetaLLM reduces API spend by routing easy queries to cheaper models and routes hard queries to stronger models, giving modest accuracy gains and up to ~60% cost reductions versus always using the priciest API.
Who Should Care
Summary TLDR
MetaLLM is a lightweight wrapper that inspects each input, embeds it with Sentence-BERT, and uses a contextual multi-armed bandit (linear reward + UCB) to pick which LLM to call. The reward trades accuracy against monetary cost (reward = correctness − p·cost). Experiments on SST-2 (OpenAI models) and MMLU (Together AI models) show MetaLLM can slightly improve accuracy over the best single model (≈+0.5–1.3%) while reducing API spend (up to ~60% vs. the most expensive OpenAI model, ~10% on Together AI). The system supports offline training, online updates, and hybrid modes.
Problem Statement
Users have access to many LLMs with different costs and strengths. Picking one fixed model wastes money or accuracy on per-query basis. The problem: route each query to the LLM that maximizes correctness while respecting a cost budget.
Main Contribution
MetaLLM: a general wrapper that routes each query to one LLM chosen from a pool to optimize accuracy-cost trade-off.
A practical routing algorithm: use SBERT embeddings, linear reward models trained to predict (accuracy − p·cost), and an online contextual bandit with UCB.
Key Findings
MetaLLM can outperform a mid-tier baseline (text-babbage-001) while keeping the same budget.
MetaLLM matches or slightly improves top-model accuracy while cutting cost versus the priciest model.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | MetaLLM 84.06% | text-babbage-001 82.80% | +1.26% | SST-2 test | Table 3: MetaLLM (text-babbage-001 budget) | Table 3 |
| Accuracy | MetaLLM 91.97%; cost 2.03 | text-davinci-002 91.51%; cost 4.80 | +0.46% acc; −57.7% cost | SST-2 test | Table 3: MetaLLM (text-davinci-002 budget) | Table 3 |
What To Try In 7 Days
Run a quick prototype: collect 1–2k labeled samples and compute SBERT embeddings.
Train the linear reward model (reward = correctness − p·cost) and pick p to match your budget.
Deploy MetaLLM in online mode with UCB updates and measure cost/accuracy trade-offs for a week.
Agent Features
Memory
Tool Use
Frameworks
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluated only on zero-shot classification and multiple-choice QA, not free-form generation.
Router is a simple linear model on SBERT embeddings; richer features or nonlinear models might improve routing.
When Not To Use
When task output quality needs human judgment or complex free-text evaluation.
When no labeled training or validation data is available and online initialization is undesirable.
Failure Modes
Poor initialization: without good offline data, online bandit can default to the cheapest model (observed with Gemma).
Dynamic cost penalty can push long, complex queries to cheaper models and reduce accuracy.

