Train one model to act like many agents: Chain-of-Agents (CoA) and Agent Foundation Models (AFM)

August 6, 20259 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.8

Citation Count

0

Authors

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou

Links

Abstract / PDF

Why It Matters For Business

CoA shows you can capture multi-agent workflows inside a single model, which reduces token and tool-call costs and improves task success for web search, coding, and math problems. That reduces API/inference bill and simplifies engineering (fewer moving parts).

Summary TLDR

This paper introduces Chain-of-Agents (CoA): a way to train a single LLM to simulate multi-agent workflows end-to-end. They distill trajectories from multi-agent systems into supervised fine-tuning data, then improve the model with agentic reinforcement learning. The resulting Agent Foundation Models (AFMs) reach new state-of-the-art results on many web, code, and math benchmarks (examples: GAIA 55.3% Pass@1, LiveCodeBench v5 47.9% Pass@1, AIME25 59.8% avg@16) while reducing token consumption vs. traditional multi-agent frameworks (reported 84.6% lower). All code, weights and data are reported as open-sourced in the paper.

Problem Statement

Existing multi-agent systems work well but rely on manual workflow and prompt engineering, create heavy communication/token costs, and can’t be trained end-to-end. The paper asks: can one model be trained to natively emulate multi-agent collaboration (tools + roles) and be improved by data-driven training and RL?

Main Contribution

Chain-of-Agents (CoA): a modelling paradigm that lets a single LLM dynamically activate role-playing and tool agents to simulate multi-agent collaboration inside one decoding process.

Multi-agent distillation: a pipeline that records trajectories of strong multi-agent systems (e.g., OAgents) and converts them into CoA-format supervised fine-tuning data.

Agentic RL: a reinforcement learning stage (DAPO / VeRL) with task-level reward design to further optimize tool coordination and long-horizon success.

Agent Foundation Models (AFMs): trained models (SFT-only and SFT+RL) that set new state-of-the-art on many web, code, and math agent benchmarks. The paper open-sources code, weights and data.

Key Findings

AFM achieves new state-of-the-art on web agent benchmarks using a 32B backbone.

NumbersGAIA Pass@1 = 55.3% (Qwen-2.5-32B-Instruct, Table 7)

Agent foundation models improve code and math contest performance after RL.

NumbersLiveCodeBench v5 Pass@1 = 47.9%; AIME25 avg@16 = 59.8% (AFM-RL, 32B; Tables 12 & 11)

AFM cuts inference token consumption vs. traditional multi-agent systems.

NumbersToken consumption reduced by 84.6% (reported in text)

Results

GAIA Pass@1 (web agent)

Value55.3%

BaselineWebSailor (same size) 53.2% / WebDancer 51.5%

LiveCodeBench v5 Pass@1 (code agent)

Value47.9%

BaselineReTool / Reveal reported lower for same-size baselines

AIME25 avg@16 (math)

Value59.8%

BaselinePrevious best TIR methods (ReTool/SimpleTIR) ~49.3-50.0%

Token consumption reduction

Value84.6% lower tokens

BaselineTraditional multi-agent systems (measured on GAIA sample)

Who Should Care

What To Try In 7 Days

Run a quick distillation experiment: record trajectories from an existing multi-agent pipeline (10-100 tasks) and fine-tune your backbone on those trajectories.

Evaluate token consumption and tool-call count before and after distillation to measure cost savings.

If you have verifiable tasks (code/tests or math), add a small RL loop with binary success rewards to see short-term gains.

Agent Features

Memory

  • Persistent reasoning state S_t during decoding (keeps context across roles)
  • Long context windows (16k–32k tokens) for extended reasoning

Planning

  • Plan Agent for task decomposition
  • Thinking Agent coordinates role activation
  • Reflection and Verification agents for self-critique

Tool Use

  • Search Agent (Serpapi)
  • Crawl Page Agent (Jina + page summarization)
  • Code Generate / Execute Agent (nsjail sandbox)

Frameworks

  • Multi-agent distillation (teacher: OAgents)
  • Agentic RL using DAPO and VeRL

Is Agentic

true

Architectures

  • Chain-of-Agents (single-model multi-role decoding)
  • Role-based activation inside one decoder

Collaboration

  • Dynamic activation of role-playing agents inside single model
  • Distilled multi-agent activation sequences (agent-level trajectories)

Optimization Features

Token Efficiency

  • Reported 84.6% reduction in token consumption vs multi-agent systems

Model Optimization

  • Sequence-level agent distillation (transfer of agent activation sequences)

System Optimization

  • SFT
  • Context length management (16k→32k schedule)

Training Optimization

  • SFT
  • DAPO policy optimization for RL stage

Inference Optimization

  • Test-time scaling (best-of-N and Pass@K selection strategies)
  • Fewer tool calls by modeling intra-agent communication inside model

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Tool-format sensitivity: models trained with strict code-format constraints generalize poorly to different formatting requirements (Section 5.2).
  • RL and distillation require substantial compute and curated high-quality trajectories; dataset curation is non-trivial.
  • Reported token-efficiency numbers are based on small GAIA efficiency trials (10 samples cited) and may vary by domain.
  • Pass@K / test-time scaling improves results but increases inference cost; trade-offs must be measured.

When Not To Use

  • When you cannot collect high-quality multi-agent trajectories for distillation.
  • When strict per-tool formatting is unknown or highly variable and you cannot retrain for that format.
  • When you need ultra-low-latency single-shot inference with no room for model-side orchestration.

Failure Modes

  • Format errors at tool invocation (missing backticks, bad JSON) cause parser errors and task abortion (Section 5.2).
  • Overfitting to distilled agent behaviors that rely on specific external tool implementations.
  • Reward-design brittleness in RL stage if the judge model is biased or miscalibrated.

Core Entities

Models

  • Agent Foundation Model (AFM)
  • SFT
  • AFM-RL
  • Qwen2.5-3B-Instruct
  • Qwen2.5-7B-Instruct
  • Qwen2.5-32B-Instruct
  • Qwen2.5-Coder-7B-Instruct
  • Qwen2.5-Coder-32B-Instruct

Metrics

  • Pass@1
  • avg@16
  • Accuracy
  • Token consumption per success
  • Tool calls per success

Datasets

  • GAIA
  • BrowseComp
  • HLE
  • WebWalker
  • NQ
  • HotpotQA
  • TriviaQA
  • PopQA
  • 2Wiki
  • Musique
  • LiveCodeBench v4-v5
  • CodeContests
  • AIME24
  • AIME25
  • MATH500
  • AMC23
  • OlympiadBench

Benchmarks

  • GAIA
  • BrowseComp
  • HLE
  • MHQA (multi-hop QA set)
  • LiveCodeBench
  • CodeContests
  • AIME25

Context Entities

Models

  • Deepseek-R1
  • WebSailor
  • WebShaper
  • ReTool
  • SimpleTIR
  • Reveal
  • ZeroSearch

Metrics

  • Pass@1
  • EM / F1 (not used directly for open-ended reward)

Datasets

  • NQ
  • HotpotQA
  • TriviaQA
  • PopQA

Benchmarks

  • GAIA
  • BrowseComp
  • HLE