Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
CoA shows you can capture multi-agent workflows inside a single model, which reduces token and tool-call costs and improves task success for web search, coding, and math problems. That reduces API/inference bill and simplifies engineering (fewer moving parts).
Summary TLDR
This paper introduces Chain-of-Agents (CoA): a way to train a single LLM to simulate multi-agent workflows end-to-end. They distill trajectories from multi-agent systems into supervised fine-tuning data, then improve the model with agentic reinforcement learning. The resulting Agent Foundation Models (AFMs) reach new state-of-the-art results on many web, code, and math benchmarks (examples: GAIA 55.3% Pass@1, LiveCodeBench v5 47.9% Pass@1, AIME25 59.8% avg@16) while reducing token consumption vs. traditional multi-agent frameworks (reported 84.6% lower). All code, weights and data are reported as open-sourced in the paper.
Problem Statement
Existing multi-agent systems work well but rely on manual workflow and prompt engineering, create heavy communication/token costs, and can’t be trained end-to-end. The paper asks: can one model be trained to natively emulate multi-agent collaboration (tools + roles) and be improved by data-driven training and RL?
Main Contribution
Chain-of-Agents (CoA): a modelling paradigm that lets a single LLM dynamically activate role-playing and tool agents to simulate multi-agent collaboration inside one decoding process.
Multi-agent distillation: a pipeline that records trajectories of strong multi-agent systems (e.g., OAgents) and converts them into CoA-format supervised fine-tuning data.
Agentic RL: a reinforcement learning stage (DAPO / VeRL) with task-level reward design to further optimize tool coordination and long-horizon success.
Agent Foundation Models (AFMs): trained models (SFT-only and SFT+RL) that set new state-of-the-art on many web, code, and math agent benchmarks. The paper open-sources code, weights and data.
Key Findings
AFM achieves new state-of-the-art on web agent benchmarks using a 32B backbone.
Agent foundation models improve code and math contest performance after RL.
AFM cuts inference token consumption vs. traditional multi-agent systems.
Results
GAIA Pass@1 (web agent)
LiveCodeBench v5 Pass@1 (code agent)
AIME25 avg@16 (math)
Token consumption reduction
Who Should Care
What To Try In 7 Days
Run a quick distillation experiment: record trajectories from an existing multi-agent pipeline (10-100 tasks) and fine-tune your backbone on those trajectories.
Evaluate token consumption and tool-call count before and after distillation to measure cost savings.
If you have verifiable tasks (code/tests or math), add a small RL loop with binary success rewards to see short-term gains.
Agent Features
Memory
- Persistent reasoning state S_t during decoding (keeps context across roles)
- Long context windows (16k–32k tokens) for extended reasoning
Planning
- Plan Agent for task decomposition
- Thinking Agent coordinates role activation
- Reflection and Verification agents for self-critique
Tool Use
- Search Agent (Serpapi)
- Crawl Page Agent (Jina + page summarization)
- Code Generate / Execute Agent (nsjail sandbox)
Frameworks
- Multi-agent distillation (teacher: OAgents)
- Agentic RL using DAPO and VeRL
Is Agentic
true
Architectures
- Chain-of-Agents (single-model multi-role decoding)
- Role-based activation inside one decoder
Collaboration
- Dynamic activation of role-playing agents inside single model
- Distilled multi-agent activation sequences (agent-level trajectories)
Optimization Features
Token Efficiency
- Reported 84.6% reduction in token consumption vs multi-agent systems
Model Optimization
- Sequence-level agent distillation (transfer of agent activation sequences)
System Optimization
- SFT
- Context length management (16k→32k schedule)
Training Optimization
- SFT
- DAPO policy optimization for RL stage
Inference Optimization
- Test-time scaling (best-of-N and Pass@K selection strategies)
- Fewer tool calls by modeling intra-agent communication inside model
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Tool-format sensitivity: models trained with strict code-format constraints generalize poorly to different formatting requirements (Section 5.2).
- RL and distillation require substantial compute and curated high-quality trajectories; dataset curation is non-trivial.
- Reported token-efficiency numbers are based on small GAIA efficiency trials (10 samples cited) and may vary by domain.
- Pass@K / test-time scaling improves results but increases inference cost; trade-offs must be measured.
When Not To Use
- When you cannot collect high-quality multi-agent trajectories for distillation.
- When strict per-tool formatting is unknown or highly variable and you cannot retrain for that format.
- When you need ultra-low-latency single-shot inference with no room for model-side orchestration.
Failure Modes
- Format errors at tool invocation (missing backticks, bad JSON) cause parser errors and task abortion (Section 5.2).
- Overfitting to distilled agent behaviors that rely on specific external tool implementations.
- Reward-design brittleness in RL stage if the judge model is biased or miscalibrated.
Core Entities
Models
- Agent Foundation Model (AFM)
- SFT
- AFM-RL
- Qwen2.5-3B-Instruct
- Qwen2.5-7B-Instruct
- Qwen2.5-32B-Instruct
- Qwen2.5-Coder-7B-Instruct
- Qwen2.5-Coder-32B-Instruct
Metrics
- Pass@1
- avg@16
- Accuracy
- Token consumption per success
- Tool calls per success
Datasets
- GAIA
- BrowseComp
- HLE
- WebWalker
- NQ
- HotpotQA
- TriviaQA
- PopQA
- 2Wiki
- Musique
- LiveCodeBench v4-v5
- CodeContests
- AIME24
- AIME25
- MATH500
- AMC23
- OlympiadBench
Benchmarks
- GAIA
- BrowseComp
- HLE
- MHQA (multi-hop QA set)
- LiveCodeBench
- CodeContests
- AIME25
Context Entities
Models
- Deepseek-R1
- WebSailor
- WebShaper
- ReTool
- SimpleTIR
- Reveal
- ZeroSearch
Metrics
- Pass@1
- EM / F1 (not used directly for open-ended reward)
Datasets
- NQ
- HotpotQA
- TriviaQA
- PopQA
Benchmarks
- GAIA
- BrowseComp
- HLE

