Overview
Strong empirical evidence across many benchmarks and model sizes. Results come from many SFT and RL experiments and ablations; open-sourced assets increase reproducibility. Real-world deployment still needs engineering for tool-format precision and RL compute.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 65%
Why It Matters For Business
CoA shows you can capture multi-agent workflows inside a single model, which reduces token and tool-call costs and improves task success for web search, coding, and math problems. That reduces API/inference bill and simplifies engineering (fewer moving parts).
Who Should Care
Summary TLDR
This paper introduces Chain-of-Agents (CoA): a way to train a single LLM to simulate multi-agent workflows end-to-end. They distill trajectories from multi-agent systems into supervised fine-tuning data, then improve the model with agentic reinforcement learning. The resulting Agent Foundation Models (AFMs) reach new state-of-the-art results on many web, code, and math benchmarks (examples: GAIA 55.3% Pass@1, LiveCodeBench v5 47.9% Pass@1, AIME25 59.8% avg@16) while reducing token consumption vs. traditional multi-agent frameworks (reported 84.6% lower). All code, weights and data are reported as open-sourced in the paper.
Problem Statement
Existing multi-agent systems work well but rely on manual workflow and prompt engineering, create heavy communication/token costs, and can’t be trained end-to-end. The paper asks: can one model be trained to natively emulate multi-agent collaboration (tools + roles) and be improved by data-driven training and RL?
Main Contribution
Chain-of-Agents (CoA): a modelling paradigm that lets a single LLM dynamically activate role-playing and tool agents to simulate multi-agent collaboration inside one decoding process.
Multi-agent distillation: a pipeline that records trajectories of strong multi-agent systems (e.g., OAgents) and converts them into CoA-format supervised fine-tuning data.
Key Findings
AFM achieves new state-of-the-art on web agent benchmarks using a 32B backbone.
Agent foundation models improve code and math contest performance after RL.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| GAIA Pass@1 (web agent) | 55.3% | WebSailor (same size) 53.2% / WebDancer 51.5% | +2.1% vs WebSailor (same backbone, Table 7) | GAIA (text-only subset) | AFM-RL with Qwen-2.5-32B-Instruct backbone achieved 55.3% Pass@1 (Table 7) | Table 7 |
| LiveCodeBench v5 Pass@1 (code agent) | 47.9% | ReTool / Reveal reported lower for same-size baselines | +3.2% vs AFM-SFT (32B SFT->RL gain; Table 12) | LiveCodeBench v5 | AFM-RL (32B) reached 47.9% Pass@1 (Table 12) | Table 12 |
What To Try In 7 Days
Run a quick distillation experiment: record trajectories from an existing multi-agent pipeline (10-100 tasks) and fine-tune your backbone on those trajectories.
Evaluate token consumption and tool-call count before and after distillation to measure cost savings.
If you have verifiable tasks (code/tests or math), add a small RL loop with binary success rewards to see short-term gains.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Tool-format sensitivity: models trained with strict code-format constraints generalize poorly to different formatting requirements (Section 5.2).
RL and distillation require substantial compute and curated high-quality trajectories; dataset curation is non-trivial.
When Not To Use
When you cannot collect high-quality multi-agent trajectories for distillation.
When strict per-tool formatting is unknown or highly variable and you cannot retrain for that format.
Failure Modes
Format errors at tool invocation (missing backticks, bad JSON) cause parser errors and task abortion (Section 5.2).
Overfitting to distilled agent behaviors that rely on specific external tool implementations.

