Train a cheap controller LLM to route queries to expert LLMs via RL so the system meets different cost budgets while keeping high accuracy.

November 4, 20258 min

Overview

Decision SnapshotNeeds Validation

The method is a practical system-level design with clear gains on math benchmarks. Evidence is solid for math tasks and the provided baselines. Limitations include frozen experts, narrow task domain, and vendor-dependent cost numbers.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 40%

Authors

Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang

Links

Abstract / PDF / Data

Why It Matters For Business

CORL enables predictable cost-vs-accuracy trade-offs from one deployed system. You can run a single LLM controller that adapts to customer budget tiers, saving inference spend at scale while keeping acceptable accuracy.

Who Should Care

Summary TLDR

The paper builds CORL, a centralized multi-LLM system where a controller LLM is trained by reinforcement learning to decide when to answer itself and when to call expensive expert LLMs. The RL objective combines task accuracy and a cost penalty so the same system can run in low/medium/high budget modes. On math benchmarks, CORL beats the controller alone in low-budget mode and can exceed the best single expert in high-budget mode, while allowing predictable per-query cost control.

Problem Statement

Multi-LLM setups can improve accuracy but decentralized designs call every expert for each query, creating uncontrolled and high inference costs. We need a coordinated system that: (1) selectively dispatches queries to cheaper or stronger experts, (2) adapts behavior to different budget modes at inference time, and (3) lets a cheap controller handle queries when possible to avoid needless external calls.

Main Contribution

Formalize cost-controllable centralized multi-LLM coordination: single controller routes to frozen expert models with a budget-aware objective.

CORL: an RL training pipeline (PPO + masked token loss) that optimizes a joint reward of task performance and a budget-based cost penalty.

Key Findings

CORL lets one trained controller exceed the best single expert at high budget on evaluated math sets.

NumbersMATH500: CORL High Pass@1 0.958 vs o3 0.938

Practical UseIf you can afford higher per-query cost, train a controller to route to top experts and get better accuracy than using that expert alone.

Evidence RefTable 2

CORL improves low-budget accuracy compared to the controller LLM acting alone.

NumbersMATH500: CORL Low Pass@1 0.900 vs Qwen2.5-7B 0.708

Practical UseWhen cost is tight, train a controller so it solves more queries itself and raises accuracy versus running the base controller alone.

Evidence RefTable 2; Figure 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MATH500 Pass@1CORL High 0.958o3 0.938+0.020MATH500Table 2 (Pass@1 values for MATH500)Table 2
MATH500 Pass@1 (low-budget)CORL Low 0.900Qwen2.5-7B 0.708+0.192MATH500Table 2 (low-budget vs controller-alone)Table 2

What To Try In 7 Days

Set up a cheap controller model and two expert models; implement a prompt token for budget levels (low/medium/high).

Implement a simple cost-aware reward: performance_accuracy * cost_penalty and run short RL (PPO) to learn routing.

Evaluate expert-call ratio and per-query cost on a small representative dataset; tune budget B to match business SLAs.

Agent Features

Memory
short-form rollout history (controller keeps past interactions during a query)
Planning
reasoning and decomposition into subqueriesmulti-step decision to call experts
Tool Use
selective expert LLM callsprompt-based query decoration before external calls
Frameworks
PPO (policy optimization)masked token loss to ignore expert tokens during controller update
Is Agentic

Yes

Architectures
centralized controller + frozen expert pooliterative multi-round controller-expert interaction
Collaboration
controller orchestrates and aggregates expert responses

Optimization Features

Token Efficiency
budget token in prompt to control external calls
Infra Optimization
single-node 8x A100 training; FSDP CPU offloading
Model Optimization
controller-only fine-tuning (experts frozen)
System Optimization
use vLLM for efficient rollouts; FSDP and gradient checkpointing for memory savings
Training Optimization
PPO with KL regularization and token maskingGAE for advantage estimation
Inference Optimization
avoid expert calls when controller can self-solvebudget-conditioned prompts to control inference behavior

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

DeepscalerMATH500AMC2023AIME2024AIME2025

Risks & Boundaries

Limitations

Only the controller LLM is trained; experts remain frozen. Joint training effects unexplored.

Experiments limited to math reasoning datasets; cross-domain behavior unknown.

When Not To Use

When latency must be minimal and contacting external experts adds unacceptable delay.

When a single, well-tuned LLM already meets accuracy and cost needs.

Failure Modes

Reward mis-specification can lead to excessive expert calls or degenerate zero-reward behavior if budget thresholds are too strict.

Prompt constraints can block exploration (hard constraints cause no expert calls even when beneficial).

Core Entities

Models

Qwen2.5-7B-Instructo3 (OpenAI)GPT-4.1GPT-4.1-nano

Metrics

Pass@1per-query cost ($)expert-call ratio

Datasets

DeepscalerMATH500AMC2023AIME2024AIME2025

Benchmarks

math reasoning (MATH500/AMC/AIME)