Train one model to act like many agents: Chain-of-Agents (CoA) and Agent Foundation Models (AFM)

Overview

Decision SnapshotReady For Pilot

Strong empirical evidence across many benchmarks and model sizes. Results come from many SFT and RL experiments and ablations; open-sourced assets increase reproducibility. Real-world deployment still needs engineering for tool-format precision and RL compute.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 65%

Authors

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou

Links

Abstract / PDF

Why It Matters For Business

CoA shows you can capture multi-agent workflows inside a single model, which reduces token and tool-call costs and improves task success for web search, coding, and math problems. That reduces API/inference bill and simplifies engineering (fewer moving parts).

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

This paper introduces Chain-of-Agents (CoA): a way to train a single LLM to simulate multi-agent workflows end-to-end. They distill trajectories from multi-agent systems into supervised fine-tuning data, then improve the model with agentic reinforcement learning. The resulting Agent Foundation Models (AFMs) reach new state-of-the-art results on many web, code, and math benchmarks (examples: GAIA 55.3% Pass@1, LiveCodeBench v5 47.9% Pass@1, AIME25 59.8% avg@16) while reducing token consumption vs. traditional multi-agent frameworks (reported 84.6% lower). All code, weights and data are reported as open-sourced in the paper.

Problem Statement

Existing multi-agent systems work well but rely on manual workflow and prompt engineering, create heavy communication/token costs, and can’t be trained end-to-end. The paper asks: can one model be trained to natively emulate multi-agent collaboration (tools + roles) and be improved by data-driven training and RL?

Main Contribution

Chain-of-Agents (CoA): a modelling paradigm that lets a single LLM dynamically activate role-playing and tool agents to simulate multi-agent collaboration inside one decoding process.

Multi-agent distillation: a pipeline that records trajectories of strong multi-agent systems (e.g., OAgents) and converts them into CoA-format supervised fine-tuning data.

Key Findings

AFM achieves new state-of-the-art on web agent benchmarks using a 32B backbone.

NumbersGAIA Pass@1 = 55.3% (Qwen-2.5-32B-Instruct, Table 7)

Practical UseIf you run tool-enabled web assistants, distilling multi-agent traces into a single model can raise question-answering success rates; try CoA-style SFT on your backbone to improve web search tasks.

Evidence RefTable 7

Agent foundation models improve code and math contest performance after RL.

NumbersLiveCodeBench v5 Pass@1 = 47.9%; AIME25 avg@16 = 59.8% (AFM-RL, 32B; Tables 12 & 11)

Practical UseFor coding or contest-math tasks, follow the paper's SFT-from-distillation then agentic RL pipeline to boost pass rates and generalization to hard, verifiable problems.

Evidence RefTables 12 and 11

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
GAIA Pass@1 (web agent)	55.3%	WebSailor (same size) 53.2% / WebDancer 51.5%	+2.1% vs WebSailor (same backbone, Table 7)	GAIA (text-only subset)	AFM-RL with Qwen-2.5-32B-Instruct backbone achieved 55.3% Pass@1 (Table 7)	Table 7
LiveCodeBench v5 Pass@1 (code agent)	47.9%	ReTool / Reveal reported lower for same-size baselines	+3.2% vs AFM-SFT (32B SFT->RL gain; Table 12)	LiveCodeBench v5	AFM-RL (32B) reached 47.9% Pass@1 (Table 12)	Table 12

What To Try In 7 Days

Run a quick distillation experiment: record trajectories from an existing multi-agent pipeline (10-100 tasks) and fine-tune your backbone on those trajectories.

Evaluate token consumption and tool-call count before and after distillation to measure cost savings.

If you have verifiable tasks (code/tests or math), add a small RL loop with binary success rewards to see short-term gains.

Agent Features

Memory

Persistent reasoning state S_t during decoding (keeps context across roles)Long context windows (16k–32k tokens) for extended reasoning

Planning

Plan Agent for task decompositionThinking Agent coordinates role activationReflection and Verification agents for self-critique

Tool Use

Search Agent (Serpapi)Crawl Page Agent (Jina + page summarization)Code Generate / Execute Agent (nsjail sandbox)

Frameworks

Multi-agent distillation (teacher: OAgents)Agentic RL using DAPO and VeRL

Is Agentic

Yes

Architectures

Chain-of-Agents (single-model multi-role decoding)Role-based activation inside one decoder

Collaboration

Dynamic activation of role-playing agents inside single modelDistilled multi-agent activation sequences (agent-level trajectories)

Optimization Features

Token Efficiency

Reported 84.6% reduction in token consumption vs multi-agent systems

Model Optimization

Sequence-level agent distillation (transfer of agent activation sequences)

System Optimization

SFTContext length management (16k→32k schedule)

Training Optimization

SFTDAPO policy optimization for RL stage

Inference Optimization

Test-time scaling (best-of-N and Pass@K selection strategies)Fewer tool calls by modeling intra-agent communication inside model

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Risks & Boundaries

Limitations

Tool-format sensitivity: models trained with strict code-format constraints generalize poorly to different formatting requirements (Section 5.2).

RL and distillation require substantial compute and curated high-quality trajectories; dataset curation is non-trivial.

When Not To Use

When you cannot collect high-quality multi-agent trajectories for distillation.

When strict per-tool formatting is unknown or highly variable and you cannot retrain for that format.

Failure Modes

Format errors at tool invocation (missing backticks, bad JSON) cause parser errors and task abortion (Section 5.2).

Overfitting to distilled agent behaviors that rely on specific external tool implementations.

Core Entities

Models

Agent Foundation Model (AFM)SFTAFM-RLQwen2.5-3B-InstructQwen2.5-7B-InstructQwen2.5-32B-InstructQwen2.5-Coder-7B-InstructQwen2.5-Coder-32B-Instruct

Metrics

Pass@1avg@16AccuracyToken consumption per successTool calls per success

Datasets

GAIABrowseCompHLEWebWalkerNQHotpotQATriviaQAPopQA2WikiMusiqueLiveCodeBench v4-v5CodeContestsAIME24AIME25MATH500AMC23OlympiadBench

Benchmarks

GAIABrowseCompHLEMHQA (multi-hop QA set)LiveCodeBenchCodeContestsAIME25

Context Entities

Models

Deepseek-R1WebSailorWebShaperReToolSimpleTIRRevealZeroSearch

Metrics

Pass@1EM / F1 (not used directly for open-ended reward)

Datasets

NQHotpotQATriviaQAPopQA

Benchmarks

GAIABrowseCompHLE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AFM achieves new state-of-the-art on web agent benchmarks using a 32B backbone.

Agent foundation models improve code and math contest performance after RL.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding