Train a cheap controller LLM to route queries to expert LLMs via RL so the system meets different cost budgets while keeping high accuracy.

November 4, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.75

Citation Count

0

Authors

Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang

Links

Abstract / PDF

Why It Matters For Business

CORL enables predictable cost-vs-accuracy trade-offs from one deployed system. You can run a single LLM controller that adapts to customer budget tiers, saving inference spend at scale while keeping acceptable accuracy.

Summary TLDR

The paper builds CORL, a centralized multi-LLM system where a controller LLM is trained by reinforcement learning to decide when to answer itself and when to call expensive expert LLMs. The RL objective combines task accuracy and a cost penalty so the same system can run in low/medium/high budget modes. On math benchmarks, CORL beats the controller alone in low-budget mode and can exceed the best single expert in high-budget mode, while allowing predictable per-query cost control.

Problem Statement

Multi-LLM setups can improve accuracy but decentralized designs call every expert for each query, creating uncontrolled and high inference costs. We need a coordinated system that: (1) selectively dispatches queries to cheaper or stronger experts, (2) adapts behavior to different budget modes at inference time, and (3) lets a cheap controller handle queries when possible to avoid needless external calls.

Main Contribution

Formalize cost-controllable centralized multi-LLM coordination: single controller routes to frozen expert models with a budget-aware objective.

CORL: an RL training pipeline (PPO + masked token loss) that optimizes a joint reward of task performance and a budget-based cost penalty.

Multi-budget training: condition controller on budget tokens (low/medium/high) so one trained system can operate at different inference-cost targets.

Empirical study on math reasoning: shows controllable expert-call ratios and performance-cost trade-offs across four datasets.

Key Findings

CORL lets one trained controller exceed the best single expert at high budget on evaluated math sets.

NumbersMATH500: CORL High Pass@1 0.958 vs o3 0.938

CORL improves low-budget accuracy compared to the controller LLM acting alone.

NumbersMATH500: CORL Low Pass@1 0.900 vs Qwen2.5-7B 0.708

Budget conditioning produces predictable expert-call ratios learned by RL.

NumbersExpert-call rate: low < medium < high (Figure 3)

Reward design drives routing: with tight budget B the controller avoids expensive experts even if they raise accuracy.

NumbersUnder B=0.001 controller avoids o3 despite o3 higher task reward (Figure 4)

The controller-only optimization keeps training simpler and cheaper—experts stay frozen.

NumbersOnly controller parameters updated; experts left frozen (Section 3.2)

Results

MATH500 Pass@1

ValueCORL High 0.958

Baselineo3 0.938

MATH500 Pass@1 (low-budget)

ValueCORL Low 0.900

BaselineQwen2.5-7B 0.708

AIME2024 Pass@1

ValueCORL High 0.877

Baselineo3 0.871

Per-dataset total cost ($)

ValueMATH500 CORL High $5.87

Baselineo3 $5.642

Who Should Care

What To Try In 7 Days

Set up a cheap controller model and two expert models; implement a prompt token for budget levels (low/medium/high).

Implement a simple cost-aware reward: performance_accuracy * cost_penalty and run short RL (PPO) to learn routing.

Evaluate expert-call ratio and per-query cost on a small representative dataset; tune budget B to match business SLAs.

Agent Features

Memory

  • short-form rollout history (controller keeps past interactions during a query)

Planning

  • reasoning and decomposition into subqueries
  • multi-step decision to call experts

Tool Use

  • selective expert LLM calls
  • prompt-based query decoration before external calls

Frameworks

  • PPO (policy optimization)
  • masked token loss to ignore expert tokens during controller update

Is Agentic

true

Architectures

  • centralized controller + frozen expert pool
  • iterative multi-round controller-expert interaction

Collaboration

  • controller orchestrates and aggregates expert responses

Optimization Features

Token Efficiency

  • budget token in prompt to control external calls

Infra Optimization

  • single-node 8x A100 training; FSDP CPU offloading

Model Optimization

  • controller-only fine-tuning (experts frozen)

System Optimization

  • use vLLM for efficient rollouts; FSDP and gradient checkpointing for memory savings

Training Optimization

  • PPO with KL regularization and token masking
  • GAE for advantage estimation

Inference Optimization

  • avoid expert calls when controller can self-solve
  • budget-conditioned prompts to control inference behavior

Reproducibility

Data Urls

  • Deepscaler
  • MATH500
  • AMC2023
  • AIME2024
  • AIME2025

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only the controller LLM is trained; experts remain frozen. Joint training effects unexplored.
  • Experiments limited to math reasoning datasets; cross-domain behavior unknown.
  • Some evaluation sets are small (AMC/AIME with 30–40 items), increasing variance in results.
  • Cost numbers depend on vendor prices cited; actual savings vary by provider and contract.
  • Prompt design strongly affects exploration and learned policies (sensitivity to system prompts).

When Not To Use

  • When latency must be minimal and contacting external experts adds unacceptable delay.
  • When a single, well-tuned LLM already meets accuracy and cost needs.
  • When you cannot call external models due to privacy, compliance, or availability constraints.

Failure Modes

  • Reward mis-specification can lead to excessive expert calls or degenerate zero-reward behavior if budget thresholds are too strict.
  • Prompt constraints can block exploration (hard constraints cause no expert calls even when beneficial).
  • Small evaluation sets may hide overfitting to dataset idiosyncrasies; real-world tasks may behave differently.

Core Entities

Models

  • Qwen2.5-7B-Instruct
  • o3 (OpenAI)
  • GPT-4.1
  • GPT-4.1-nano

Metrics

  • Pass@1
  • per-query cost ($)
  • expert-call ratio

Datasets

  • Deepscaler
  • MATH500
  • AMC2023
  • AIME2024
  • AIME2025

Benchmarks

  • math reasoning (MATH500/AMC/AIME)