Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.75
Citation Count
0
Why It Matters For Business
CORL enables predictable cost-vs-accuracy trade-offs from one deployed system. You can run a single LLM controller that adapts to customer budget tiers, saving inference spend at scale while keeping acceptable accuracy.
Summary TLDR
The paper builds CORL, a centralized multi-LLM system where a controller LLM is trained by reinforcement learning to decide when to answer itself and when to call expensive expert LLMs. The RL objective combines task accuracy and a cost penalty so the same system can run in low/medium/high budget modes. On math benchmarks, CORL beats the controller alone in low-budget mode and can exceed the best single expert in high-budget mode, while allowing predictable per-query cost control.
Problem Statement
Multi-LLM setups can improve accuracy but decentralized designs call every expert for each query, creating uncontrolled and high inference costs. We need a coordinated system that: (1) selectively dispatches queries to cheaper or stronger experts, (2) adapts behavior to different budget modes at inference time, and (3) lets a cheap controller handle queries when possible to avoid needless external calls.
Main Contribution
Formalize cost-controllable centralized multi-LLM coordination: single controller routes to frozen expert models with a budget-aware objective.
CORL: an RL training pipeline (PPO + masked token loss) that optimizes a joint reward of task performance and a budget-based cost penalty.
Multi-budget training: condition controller on budget tokens (low/medium/high) so one trained system can operate at different inference-cost targets.
Empirical study on math reasoning: shows controllable expert-call ratios and performance-cost trade-offs across four datasets.
Key Findings
CORL lets one trained controller exceed the best single expert at high budget on evaluated math sets.
CORL improves low-budget accuracy compared to the controller LLM acting alone.
Budget conditioning produces predictable expert-call ratios learned by RL.
Reward design drives routing: with tight budget B the controller avoids expensive experts even if they raise accuracy.
The controller-only optimization keeps training simpler and cheaper—experts stay frozen.
Results
MATH500 Pass@1
MATH500 Pass@1 (low-budget)
AIME2024 Pass@1
Per-dataset total cost ($)
Who Should Care
What To Try In 7 Days
Set up a cheap controller model and two expert models; implement a prompt token for budget levels (low/medium/high).
Implement a simple cost-aware reward: performance_accuracy * cost_penalty and run short RL (PPO) to learn routing.
Evaluate expert-call ratio and per-query cost on a small representative dataset; tune budget B to match business SLAs.
Agent Features
Memory
- short-form rollout history (controller keeps past interactions during a query)
Planning
- reasoning and decomposition into subqueries
- multi-step decision to call experts
Tool Use
- selective expert LLM calls
- prompt-based query decoration before external calls
Frameworks
- PPO (policy optimization)
- masked token loss to ignore expert tokens during controller update
Is Agentic
true
Architectures
- centralized controller + frozen expert pool
- iterative multi-round controller-expert interaction
Collaboration
- controller orchestrates and aggregates expert responses
Optimization Features
Token Efficiency
- budget token in prompt to control external calls
Infra Optimization
- single-node 8x A100 training; FSDP CPU offloading
Model Optimization
- controller-only fine-tuning (experts frozen)
System Optimization
- use vLLM for efficient rollouts; FSDP and gradient checkpointing for memory savings
Training Optimization
- PPO with KL regularization and token masking
- GAE for advantage estimation
Inference Optimization
- avoid expert calls when controller can self-solve
- budget-conditioned prompts to control inference behavior
Reproducibility
Data Urls
- Deepscaler
- MATH500
- AMC2023
- AIME2024
- AIME2025
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only the controller LLM is trained; experts remain frozen. Joint training effects unexplored.
- Experiments limited to math reasoning datasets; cross-domain behavior unknown.
- Some evaluation sets are small (AMC/AIME with 30–40 items), increasing variance in results.
- Cost numbers depend on vendor prices cited; actual savings vary by provider and contract.
- Prompt design strongly affects exploration and learned policies (sensitivity to system prompts).
When Not To Use
- When latency must be minimal and contacting external experts adds unacceptable delay.
- When a single, well-tuned LLM already meets accuracy and cost needs.
- When you cannot call external models due to privacy, compliance, or availability constraints.
Failure Modes
- Reward mis-specification can lead to excessive expert calls or degenerate zero-reward behavior if budget thresholds are too strict.
- Prompt constraints can block exploration (hard constraints cause no expert calls even when beneficial).
- Small evaluation sets may hide overfitting to dataset idiosyncrasies; real-world tasks may behave differently.
Core Entities
Models
- Qwen2.5-7B-Instruct
- o3 (OpenAI)
- GPT-4.1
- GPT-4.1-nano
Metrics
- Pass@1
- per-query cost ($)
- expert-call ratio
Datasets
- Deepscaler
- MATH500
- AMC2023
- AIME2024
- AIME2025
Benchmarks
- math reasoning (MATH500/AMC/AIME)

