Use a small RL router to pick model sizes per request and keep LLM services fast and cheap under bursty load

January 15, 20248 min

Overview

Decision SnapshotNeeds Validation

Experiments show consistent gains on synthetic MAF-derived workloads and multiple task mixes, but results are simulated on 4 GPUs and specific OPT models; production gains depend on real traffic and task set.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

Links

Abstract / PDF

Why It Matters For Business

A small learned router can cut GPU costs or delay scaling while keeping user-facing LLM services responsive during bursts, increasing quality-per-GPU and availability.

Who Should Care

Summary TLDR

This paper builds a lightweight router trained with deep Q-learning that dynamically sends requests to different-sized LLMs (OPT-125M, OPT-1.3B, OPT-6.7B) to trade quality for latency. The learned router runs cheaply on CPU, preserves availability at much higher arrival rates than a single large model, and raises performance per GPU by ~3.9× versus an 8-GPU large-model baseline on evaluated synthetic workloads. The method is robust to shifting task mixes and supports hard or soft latency deadlines. Results are from simulation with 4 GPUs and four standardized NLP tasks.

Problem Statement

Over-provisioning GPUs to handle bursty LLM requests is costly. Static single-model serving either wastes hardware or misses latency deadlines and degrades user experience. We need a low-cost, online policy that routes requests to different model sizes to maximize overall quality while meeting latency constraints.

Main Contribution

A best-effort serving framework that routes requests among multiple model sizes using a small RL router to trade quality vs latency.

Implementation detail: router is a 2-layer MLP trained with Double DQN and uses task id, per-model batch sizes, and estimated arrival rate as state.

Key Findings

Learned router preserves availability at much higher arrival rates than serving only the large model.

NumbersRemains available for >10× faster arrival rates than OPT-6.7B (stable workload).

Practical UseIf you have bursty traffic, a learned router can avoid buying extra GPUs while keeping your service responsive.

Evidence RefSection 4.3; Figure 2 discussion

On unpredictable workloads the policy produces many more high-quality windows than static large-model serving.

Numbers1.53× more windows ≥99%, 2.32× more ≥98%, 4.11× more ≥96% (first unpredictable workload).

Practical UseYour service will hit near-peak quality more often during bursts by routing dynamically.

Evidence RefTable 3; Section 4.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Availability under loadRemains available at >10× higher arrival rates vs OPT-6.7BOPT-6.7B single-model serving>10×stable workloadSection 4.3; Figure 2Figure 2 discussion
High-quality windows (≥99%)1.53× more windows ≥99% vs OPT-6.7BOPT-6.7B×1.53first unpredictable workloadTable 3; Section 4.4Table 3

What To Try In 7 Days

Measure per-task accuracy of two model sizes and a large model for your tasks.

Prototype a simple router that switches between two models by arrival rate and deadlines.

Train a small DQN on synthetic arrival traces resembling your traffic and evaluate windowed quality and GPU utilization.

Agent Features

Memory
Keeps per-model batch sizes and running arrival-rate estimate (last 5 arrivals)
Planning
Reactive per-request routing (no long horizon planner)
Tool Use
None (router dispatches to models; uses vLLM for serving)
Frameworks
Double Q-learningLoRA
Is Agentic

Yes

Architectures
DQN (Deep Q-Network)2-layer MLP policy (hidden 256)

Optimization Features

Infra Optimization
Higher performance-per-GPU (hardware utility) vs static replication
Model Optimization
Dynamic model selection across sizes
System Optimization
Co-host smaller models with large model on same GPU memory fractionsPer-task reward tuning (hard/soft deadlines)
Training Optimization
Single DQN policy trained with random arrival-rate sampling
Inference Optimization
CPU router inference (≈0.1 ms) to minimize control latencyAutomatic replica load-balancing to smallest batch

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use synthetic workloads derived from MAF traces rather than live production traces.

Testbed is limited to OPT models (125M–6.7B) and 4 GPUs; larger model families or different hardware may change outcomes.

When Not To Use

If you serve a single fixed model or cannot tag requests by task.

If strict worst-case latency guarantees for every request are mandatory (RL is tuned for average cumulative utility).

Failure Modes

Bad arrival-rate estimates can cause suboptimal routing and missed deadlines.

Reward mis-specification (wrong task utilities) can bias routing away from priority users.

Core Entities

Models

OPT-125MOPT-1.3BOPT-6.7B

Metrics

windowed average performance (size 20)percentage of peak performancehardware utility (performance per GPU)task-specific model selection frequency

Datasets

HellaSwagCOPAPIQAOpenBookQA

Benchmarks

stable synthetic workloadunpredictable workload #1unpredictable workload #2