Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
MetaLLM reduces API spend by routing easy queries to cheaper models and routes hard queries to stronger models, giving modest accuracy gains and up to ~60% cost reductions versus always using the priciest API.
Summary TLDR
MetaLLM is a lightweight wrapper that inspects each input, embeds it with Sentence-BERT, and uses a contextual multi-armed bandit (linear reward + UCB) to pick which LLM to call. The reward trades accuracy against monetary cost (reward = correctness − p·cost). Experiments on SST-2 (OpenAI models) and MMLU (Together AI models) show MetaLLM can slightly improve accuracy over the best single model (≈+0.5–1.3%) while reducing API spend (up to ~60% vs. the most expensive OpenAI model, ~10% on Together AI). The system supports offline training, online updates, and hybrid modes.
Problem Statement
Users have access to many LLMs with different costs and strengths. Picking one fixed model wastes money or accuracy on per-query basis. The problem: route each query to the LLM that maximizes correctness while respecting a cost budget.
Main Contribution
MetaLLM: a general wrapper that routes each query to one LLM chosen from a pool to optimize accuracy-cost trade-off.
A practical routing algorithm: use SBERT embeddings, linear reward models trained to predict (accuracy − p·cost), and an online contextual bandit with UCB.
Empirical validation across OpenAI and Together AI models on SST-2 and MMLU showing modest accuracy gains and substantial cost savings.
Key Findings
MetaLLM can outperform a mid-tier baseline (text-babbage-001) while keeping the same budget.
MetaLLM matches or slightly improves top-model accuracy while cutting cost versus the priciest model.
MetaLLM improves MMLU accuracy and reduces cost on a heterogeneous provider set.
Results
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run a quick prototype: collect 1–2k labeled samples and compute SBERT embeddings.
Train the linear reward model (reward = correctness − p·cost) and pick p to match your budget.
Deploy MetaLLM in online mode with UCB updates and measure cost/accuracy trade-offs for a week.
Agent Features
Memory
- online updates (contextual bandit)
Tool Use
- model selection per query
Frameworks
- contextual multi-armed bandit
- UCB
- linear reward model (ridge)
Optimization Features
Token Efficiency
- cost normalization of token-based API pricing in reward
Model Optimization
- reward-guided routing to reduce calls to expensive models
System Optimization
- mix of offline initialization and online updates to adapt to distribution shifts
Training Optimization
- closed-form ridge regression initialization for linear reward models
Inference Optimization
- single-model per query saves token/API cost
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluated only on zero-shot classification and multiple-choice QA, not free-form generation.
- Router is a simple linear model on SBERT embeddings; richer features or nonlinear models might improve routing.
- Reward combines only correctness and monetary cost; latency, robustness, and other metrics are not included.
- Offline training can fail if auxiliary training distribution differs strongly from test distribution.
When Not To Use
- When task output quality needs human judgment or complex free-text evaluation.
- When no labeled training or validation data is available and online initialization is undesirable.
- When you must optimize for latency or other non-monetary trade-offs not in the reward.
Failure Modes
- Poor initialization: without good offline data, online bandit can default to the cheapest model (observed with Gemma).
- Dynamic cost penalty can push long, complex queries to cheaper models and reduce accuracy.
- Distribution mismatch between auxiliary training data and production queries can lead to suboptimal routing.
Core Entities
Models
- text-ada-001
- text-babbage-001
- text-curie-001
- text-davinci-002
- Gemma
- Llama
- Mistral
- Qwen
Metrics
- Accuracy
- api_cost
Datasets
- SST-2
- MMLU
Context Entities
Metrics
- average cost per 10,000 queries

