Train one model to act like many agents: Chain-of-Agents (CoA) and Agent Foundation Models (AFM)

August 6, 20259 min

Overview

Decision SnapshotReady For Pilot

Strong empirical evidence across many benchmarks and model sizes. Results come from many SFT and RL experiments and ablations; open-sourced assets increase reproducibility. Real-world deployment still needs engineering for tool-format precision and RL compute.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 65%

Authors

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou

Links

Abstract / PDF

Why It Matters For Business

CoA shows you can capture multi-agent workflows inside a single model, which reduces token and tool-call costs and improves task success for web search, coding, and math problems. That reduces API/inference bill and simplifies engineering (fewer moving parts).

Who Should Care

Summary TLDR

This paper introduces Chain-of-Agents (CoA): a way to train a single LLM to simulate multi-agent workflows end-to-end. They distill trajectories from multi-agent systems into supervised fine-tuning data, then improve the model with agentic reinforcement learning. The resulting Agent Foundation Models (AFMs) reach new state-of-the-art results on many web, code, and math benchmarks (examples: GAIA 55.3% Pass@1, LiveCodeBench v5 47.9% Pass@1, AIME25 59.8% avg@16) while reducing token consumption vs. traditional multi-agent frameworks (reported 84.6% lower). All code, weights and data are reported as open-sourced in the paper.

Problem Statement

Existing multi-agent systems work well but rely on manual workflow and prompt engineering, create heavy communication/token costs, and can’t be trained end-to-end. The paper asks: can one model be trained to natively emulate multi-agent collaboration (tools + roles) and be improved by data-driven training and RL?

Main Contribution

Chain-of-Agents (CoA): a modelling paradigm that lets a single LLM dynamically activate role-playing and tool agents to simulate multi-agent collaboration inside one decoding process.

Multi-agent distillation: a pipeline that records trajectories of strong multi-agent systems (e.g., OAgents) and converts them into CoA-format supervised fine-tuning data.

Key Findings

AFM achieves new state-of-the-art on web agent benchmarks using a 32B backbone.

NumbersGAIA Pass@1 = 55.3% (Qwen-2.5-32B-Instruct, Table 7)

Practical UseIf you run tool-enabled web assistants, distilling multi-agent traces into a single model can raise question-answering success rates; try CoA-style SFT on your backbone to improve web search tasks.

Evidence RefTable 7

Agent foundation models improve code and math contest performance after RL.

NumbersLiveCodeBench v5 Pass@1 = 47.9%; AIME25 avg@16 = 59.8% (AFM-RL, 32B; Tables 12 & 11)

Practical UseFor coding or contest-math tasks, follow the paper's SFT-from-distillation then agentic RL pipeline to boost pass rates and generalization to hard, verifiable problems.

Evidence RefTables 12 and 11

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GAIA Pass@1 (web agent)55.3%WebSailor (same size) 53.2% / WebDancer 51.5%+2.1% vs WebSailor (same backbone, Table 7)GAIA (text-only subset)AFM-RL with Qwen-2.5-32B-Instruct backbone achieved 55.3% Pass@1 (Table 7)Table 7
LiveCodeBench v5 Pass@1 (code agent)47.9%ReTool / Reveal reported lower for same-size baselines+3.2% vs AFM-SFT (32B SFT->RL gain; Table 12)LiveCodeBench v5AFM-RL (32B) reached 47.9% Pass@1 (Table 12)Table 12

What To Try In 7 Days

Run a quick distillation experiment: record trajectories from an existing multi-agent pipeline (10-100 tasks) and fine-tune your backbone on those trajectories.

Evaluate token consumption and tool-call count before and after distillation to measure cost savings.

If you have verifiable tasks (code/tests or math), add a small RL loop with binary success rewards to see short-term gains.

Agent Features

Memory
Persistent reasoning state S_t during decoding (keeps context across roles)Long context windows (16k–32k tokens) for extended reasoning
Planning
Plan Agent for task decompositionThinking Agent coordinates role activationReflection and Verification agents for self-critique
Tool Use
Search Agent (Serpapi)Crawl Page Agent (Jina + page summarization)Code Generate / Execute Agent (nsjail sandbox)
Frameworks
Multi-agent distillation (teacher: OAgents)Agentic RL using DAPO and VeRL
Is Agentic

Yes

Architectures
Chain-of-Agents (single-model multi-role decoding)Role-based activation inside one decoder
Collaboration
Dynamic activation of role-playing agents inside single modelDistilled multi-agent activation sequences (agent-level trajectories)

Optimization Features

Token Efficiency
Reported 84.6% reduction in token consumption vs multi-agent systems
Model Optimization
Sequence-level agent distillation (transfer of agent activation sequences)
System Optimization
SFTContext length management (16k→32k schedule)
Training Optimization
SFTDAPO policy optimization for RL stage
Inference Optimization
Test-time scaling (best-of-N and Pass@K selection strategies)Fewer tool calls by modeling intra-agent communication inside model

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Tool-format sensitivity: models trained with strict code-format constraints generalize poorly to different formatting requirements (Section 5.2).

RL and distillation require substantial compute and curated high-quality trajectories; dataset curation is non-trivial.

When Not To Use

When you cannot collect high-quality multi-agent trajectories for distillation.

When strict per-tool formatting is unknown or highly variable and you cannot retrain for that format.

Failure Modes

Format errors at tool invocation (missing backticks, bad JSON) cause parser errors and task abortion (Section 5.2).

Overfitting to distilled agent behaviors that rely on specific external tool implementations.

Core Entities

Models

Agent Foundation Model (AFM)SFTAFM-RLQwen2.5-3B-InstructQwen2.5-7B-InstructQwen2.5-32B-InstructQwen2.5-Coder-7B-InstructQwen2.5-Coder-32B-Instruct

Metrics

Pass@1avg@16AccuracyToken consumption per successTool calls per success

Datasets

GAIABrowseCompHLEWebWalkerNQHotpotQATriviaQAPopQA2WikiMusiqueLiveCodeBench v4-v5CodeContestsAIME24AIME25MATH500AMC23OlympiadBench

Benchmarks

GAIABrowseCompHLEMHQA (multi-hop QA set)LiveCodeBenchCodeContestsAIME25

Context Entities

Models

Deepseek-R1WebSailorWebShaperReToolSimpleTIRRevealZeroSearch

Metrics

Pass@1EM / F1 (not used directly for open-ended reward)

Datasets

NQHotpotQATriviaQAPopQA

Benchmarks

GAIABrowseCompHLE