Overview
Experiments show consistent gains on synthetic MAF-derived workloads and multiple task mixes, but results are simulated on 4 GPUs and specific OPT models; production gains depend on real traffic and task set.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
A small learned router can cut GPU costs or delay scaling while keeping user-facing LLM services responsive during bursts, increasing quality-per-GPU and availability.
Who Should Care
Summary TLDR
This paper builds a lightweight router trained with deep Q-learning that dynamically sends requests to different-sized LLMs (OPT-125M, OPT-1.3B, OPT-6.7B) to trade quality for latency. The learned router runs cheaply on CPU, preserves availability at much higher arrival rates than a single large model, and raises performance per GPU by ~3.9× versus an 8-GPU large-model baseline on evaluated synthetic workloads. The method is robust to shifting task mixes and supports hard or soft latency deadlines. Results are from simulation with 4 GPUs and four standardized NLP tasks.
Problem Statement
Over-provisioning GPUs to handle bursty LLM requests is costly. Static single-model serving either wastes hardware or misses latency deadlines and degrades user experience. We need a low-cost, online policy that routes requests to different model sizes to maximize overall quality while meeting latency constraints.
Main Contribution
A best-effort serving framework that routes requests among multiple model sizes using a small RL router to trade quality vs latency.
Implementation detail: router is a 2-layer MLP trained with Double DQN and uses task id, per-model batch sizes, and estimated arrival rate as state.
Key Findings
Learned router preserves availability at much higher arrival rates than serving only the large model.
On unpredictable workloads the policy produces many more high-quality windows than static large-model serving.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Availability under load | Remains available at >10× higher arrival rates vs OPT-6.7B | OPT-6.7B single-model serving | >10× | stable workload | Section 4.3; Figure 2 | Figure 2 discussion |
| High-quality windows (≥99%) | 1.53× more windows ≥99% vs OPT-6.7B | OPT-6.7B | ×1.53 | first unpredictable workload | Table 3; Section 4.4 | Table 3 |
What To Try In 7 Days
Measure per-task accuracy of two model sizes and a large model for your tasks.
Prototype a simple router that switches between two models by arrival rate and deadlines.
Train a small DQN on synthetic arrival traces resembling your traffic and evaluate windowed quality and GPU utilization.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluations use synthetic workloads derived from MAF traces rather than live production traces.
Testbed is limited to OPT models (125M–6.7B) and 4 GPUs; larger model families or different hardware may change outcomes.
When Not To Use
If you serve a single fixed model or cannot tag requests by task.
If strict worst-case latency guarantees for every request are mandatory (RL is tuned for average cumulative utility).
Failure Modes
Bad arrival-rate estimates can cause suboptimal routing and missed deadlines.
Reward mis-specification (wrong task utilities) can bias routing away from priority users.

