Use a small RL router to pick model sizes per request and keep LLM services fast and cheap under bursty load

Overview

Decision SnapshotNeeds Validation

Experiments show consistent gains on synthetic MAF-derived workloads and multiple task mixes, but results are simulated on 4 GPUs and specific OPT models; production gains depend on real traffic and task set.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, Sehoon Kim, Kurt Keutzer

Links

Abstract / PDF

Why It Matters For Business

A small learned router can cut GPU costs or delay scaling while keeping user-facing LLM services responsive during bursts, increasing quality-per-GPU and availability.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper builds a lightweight router trained with deep Q-learning that dynamically sends requests to different-sized LLMs (OPT-125M, OPT-1.3B, OPT-6.7B) to trade quality for latency. The learned router runs cheaply on CPU, preserves availability at much higher arrival rates than a single large model, and raises performance per GPU by ~3.9× versus an 8-GPU large-model baseline on evaluated synthetic workloads. The method is robust to shifting task mixes and supports hard or soft latency deadlines. Results are from simulation with 4 GPUs and four standardized NLP tasks.

Problem Statement

Over-provisioning GPUs to handle bursty LLM requests is costly. Static single-model serving either wastes hardware or misses latency deadlines and degrades user experience. We need a low-cost, online policy that routes requests to different model sizes to maximize overall quality while meeting latency constraints.

Main Contribution

A best-effort serving framework that routes requests among multiple model sizes using a small RL router to trade quality vs latency.

Implementation detail: router is a 2-layer MLP trained with Double DQN and uses task id, per-model batch sizes, and estimated arrival rate as state.

Key Findings

Learned router preserves availability at much higher arrival rates than serving only the large model.

NumbersRemains available for >10× faster arrival rates than OPT-6.7B (stable workload).

Practical UseIf you have bursty traffic, a learned router can avoid buying extra GPUs while keeping your service responsive.

Evidence RefSection 4.3; Figure 2 discussion

On unpredictable workloads the policy produces many more high-quality windows than static large-model serving.

Numbers1.53× more windows ≥99%, 2.32× more ≥98%, 4.11× more ≥96% (first unpredictable workload).

Practical UseYour service will hit near-peak quality more often during bursts by routing dynamically.

Evidence RefTable 3; Section 4.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Availability under load	Remains available at >10× higher arrival rates vs OPT-6.7B	OPT-6.7B single-model serving	>10×	stable workload	Section 4.3; Figure 2	Figure 2 discussion
High-quality windows (≥99%)	1.53× more windows ≥99% vs OPT-6.7B	OPT-6.7B	×1.53	first unpredictable workload	Table 3; Section 4.4	Table 3

What To Try In 7 Days

Measure per-task accuracy of two model sizes and a large model for your tasks.

Prototype a simple router that switches between two models by arrival rate and deadlines.

Train a small DQN on synthetic arrival traces resembling your traffic and evaluate windowed quality and GPU utilization.

Agent Features

Memory

Keeps per-model batch sizes and running arrival-rate estimate (last 5 arrivals)

Planning

Reactive per-request routing (no long horizon planner)

Tool Use

None (router dispatches to models; uses vLLM for serving)

Frameworks

Double Q-learningLoRA

Is Agentic

Yes

Architectures

DQN (Deep Q-Network)2-layer MLP policy (hidden 256)

Optimization Features

Infra Optimization

Higher performance-per-GPU (hardware utility) vs static replication

Model Optimization

Dynamic model selection across sizes

System Optimization

Co-host smaller models with large model on same GPU memory fractionsPer-task reward tuning (hard/soft deadlines)

Training Optimization

Single DQN policy trained with random arrival-rate sampling

Inference Optimization

CPU router inference (≈0.1 ms) to minimize control latencyAutomatic replica load-balancing to smallest batch

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use synthetic workloads derived from MAF traces rather than live production traces.

Testbed is limited to OPT models (125M–6.7B) and 4 GPUs; larger model families or different hardware may change outcomes.

When Not To Use

If you serve a single fixed model or cannot tag requests by task.

If strict worst-case latency guarantees for every request are mandatory (RL is tuned for average cumulative utility).

Failure Modes

Bad arrival-rate estimates can cause suboptimal routing and missed deadlines.

Reward mis-specification (wrong task utilities) can bias routing away from priority users.

Core Entities

Models

OPT-125MOPT-1.3BOPT-6.7B

Metrics

windowed average performance (size 20)percentage of peak performancehardware utility (performance per GPU)task-specific model selection frequency

Datasets

HellaSwagCOPAPIQAOpenBookQA

Benchmarks

stable synthetic workloadunpredictable workload #1unpredictable workload #2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Learned router preserves availability at much higher arrival rates than serving only the large model.

On unpredictable workloads the policy produces many more high-quality windows than static large-model serving.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Route queries by model uncertainty (semantic entropy) to cut cloud calls and keep human-preferred quality

Key finding

A large, open benchmark (400K+ instances) that re-evaluates LLM routing and finds many routers match each other while leaving a big gap to a

Key finding

MMR-Bench: measure and optimize per-query model selection for multimodal LLMs under cost budgets

Key finding

RouterEval: a 200M-record benchmark showing router-based model routing can scale LLM performance by combining many weak models

Key finding

RouterBench: dataset + math to measure routing choices that trade cost vs. quality across many LLMs

Key finding