Train a cheap controller LLM to route queries to expert LLMs via RL so the system meets different cost budgets while keeping high accuracy.

Overview

Decision SnapshotNeeds Validation

The method is a practical system-level design with clear gains on math benchmarks. Evidence is solid for math tasks and the provided baselines. Limitations include frozen experts, narrow task domain, and vendor-dependent cost numbers.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 75%

Production readiness: 60%

Novelty: 40%

Authors

Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang

Links

Abstract / PDF / Data

Why It Matters For Business

CORL enables predictable cost-vs-accuracy trade-offs from one deployed system. You can run a single LLM controller that adapts to customer budget tiers, saving inference spend at scale while keeping acceptable accuracy.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

The paper builds CORL, a centralized multi-LLM system where a controller LLM is trained by reinforcement learning to decide when to answer itself and when to call expensive expert LLMs. The RL objective combines task accuracy and a cost penalty so the same system can run in low/medium/high budget modes. On math benchmarks, CORL beats the controller alone in low-budget mode and can exceed the best single expert in high-budget mode, while allowing predictable per-query cost control.

Problem Statement

Multi-LLM setups can improve accuracy but decentralized designs call every expert for each query, creating uncontrolled and high inference costs. We need a coordinated system that: (1) selectively dispatches queries to cheaper or stronger experts, (2) adapts behavior to different budget modes at inference time, and (3) lets a cheap controller handle queries when possible to avoid needless external calls.

Main Contribution

Formalize cost-controllable centralized multi-LLM coordination: single controller routes to frozen expert models with a budget-aware objective.

CORL: an RL training pipeline (PPO + masked token loss) that optimizes a joint reward of task performance and a budget-based cost penalty.

Key Findings

CORL lets one trained controller exceed the best single expert at high budget on evaluated math sets.

NumbersMATH500: CORL High Pass@1 0.958 vs o3 0.938

Practical UseIf you can afford higher per-query cost, train a controller to route to top experts and get better accuracy than using that expert alone.

Evidence RefTable 2

CORL improves low-budget accuracy compared to the controller LLM acting alone.

NumbersMATH500: CORL Low Pass@1 0.900 vs Qwen2.5-7B 0.708

Practical UseWhen cost is tight, train a controller so it solves more queries itself and raises accuracy versus running the base controller alone.

Evidence RefTable 2; Figure 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MATH500 Pass@1	CORL High 0.958	o3 0.938	+0.020	MATH500	Table 2 (Pass@1 values for MATH500)	Table 2
MATH500 Pass@1 (low-budget)	CORL Low 0.900	Qwen2.5-7B 0.708	+0.192	MATH500	Table 2 (low-budget vs controller-alone)	Table 2

What To Try In 7 Days

Set up a cheap controller model and two expert models; implement a prompt token for budget levels (low/medium/high).

Implement a simple cost-aware reward: performance_accuracy * cost_penalty and run short RL (PPO) to learn routing.

Evaluate expert-call ratio and per-query cost on a small representative dataset; tune budget B to match business SLAs.

Agent Features

Memory

short-form rollout history (controller keeps past interactions during a query)

Planning

reasoning and decomposition into subqueriesmulti-step decision to call experts

Tool Use

selective expert LLM callsprompt-based query decoration before external calls

Frameworks

PPO (policy optimization)masked token loss to ignore expert tokens during controller update

Is Agentic

Yes

Architectures

centralized controller + frozen expert pooliterative multi-round controller-expert interaction

Collaboration

controller orchestrates and aggregates expert responses

Optimization Features

Token Efficiency

budget token in prompt to control external calls

Infra Optimization

single-node 8x A100 training; FSDP CPU offloading

Model Optimization

controller-only fine-tuning (experts frozen)

System Optimization

use vLLM for efficient rollouts; FSDP and gradient checkpointing for memory savings

Training Optimization

PPO with KL regularization and token maskingGAE for advantage estimation

Inference Optimization

avoid expert calls when controller can self-solvebudget-conditioned prompts to control inference behavior

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

DeepscalerMATH500AMC2023AIME2024AIME2025

Risks & Boundaries

Limitations

Only the controller LLM is trained; experts remain frozen. Joint training effects unexplored.

Experiments limited to math reasoning datasets; cross-domain behavior unknown.

When Not To Use

When latency must be minimal and contacting external experts adds unacceptable delay.

When a single, well-tuned LLM already meets accuracy and cost needs.

Failure Modes

Reward mis-specification can lead to excessive expert calls or degenerate zero-reward behavior if budget thresholds are too strict.

Prompt constraints can block exploration (hard constraints cause no expert calls even when beneficial).

Core Entities

Models

Qwen2.5-7B-Instructo3 (OpenAI)GPT-4.1GPT-4.1-nano

Metrics

Pass@1per-query cost ($)expert-call ratio

Datasets

DeepscalerMATH500AMC2023AIME2024AIME2025

Benchmarks

math reasoning (MATH500/AMC/AIME)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CORL lets one trained controller exceed the best single expert at high budget on evaluated math sets.

CORL improves low-budget accuracy compared to the controller LLM acting alone.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding