Overview
Production Readiness
0.6
Novelty Score
0.55
Cost Impact Score
0.4
Citation Count
3
Why It Matters For Business
IoA lets you combine existing specialized agents into coordinated teams to raise task success without re-training models; expect better QA and tool use at the cost of coordination tokens and some extra infra.
Summary TLDR
IoA is a software framework that treats autonomous agents like users in an instant-messaging system: agents register, discover peers, form nested teams, follow a finite-state conversation flow, and assign tasks. Across four domains (tool use, heterogeneous architectures, embodied agents, and retrieval-augmented QA) IoA often beats single-agent baselines and some multi-agent systems. Key trade-offs: improved task success and flexibility at the cost of message overhead and extra coordination tokens. Code is public.
Problem Statement
Existing multi-agent frameworks are limited by ecosystem isolation (hard to plug in third‑party agents), single-device simulation, and rigid, hard-coded communication. The paper asks: can we build a scalable, Internet-like platform that lets diverse agents discover each other, form dynamic teams, and coordinate via flexible conversation states?
Main Contribution
An agent-integration protocol and client/server design that lets third-party agents register and communicate over the network.
An instant-messaging-style architecture with group chats, nested subgroups, and team-formation tooling.
A finite-state conversation flow (discussion, sync/async assignment, pause & trigger, conclusion) driven by LLM decisions.
Demonstrations across GAIA (tools), an open-ended instruction set (heterogeneous architectures), RoCoBench (embodied tasks), and RAG QA; shows wins over several baselines.
Public code release: https://github.com/OpenBMB/IoA
Key Findings
IoA substantially improves open-ended instruction wins when it orchestrates third-party agents.
IoA matches or exceeds single-model RAG baselines even when built on GPT-3.5.
On GAIA (tool-heavy benchmark) IoA gives the top overall validation score using four ReAct agents.
IoA achieves strong embodied-agent performance and often outperforms a domain-specific baseline.
Autonomous team formation has measurable precision but is imperfect.
Communication increases costs; removing repeated messages halves token costs in experiments.
Results
Open-ended instruction win rate vs AutoGPT
Open-ended instruction win rate vs Open Interpreter
Accuracy
Accuracy
RoCoBench success rates
GAIA validation overall
Team formation precision (regular)
Team formation precision (nested)
Cost per task (open-ended instruction benchmark)
Who Should Care
What To Try In 7 Days
Wrap two complementary agents with IoA's client API and run a few tasks to compare combined output vs running them separately.
Enable message deduplication and limit group-chat turns to cut token bills; measure cost delta.
Use IoA for a retrieval-augmented QA pipeline: assign separate retrievers to agents and compare combined accuracy to a single stronger model.
Agent Features
Memory
- local Group Info and Task Management modules (SQLite)
- session state via server registry
Planning
- nested team planning (hierarchical subgroups)
- task decomposition via LLM prompts
Tool Use
- browser, code interpreter, Wikidata search, YouTube transcript tool
- retrieval tools (Pyserini, Google Search API)
Frameworks
- client/server architecture with WebSocket
- Agent Registry + Milvus similarity search
Is Agentic
true
Architectures
- LLM-based agents (GPT-3.5/GPT-4 wrappers)
- third-party agents (AutoGPT, Open Interpreter)
- tool-augmented ReAct agents
Collaboration
- agent discovery & search
- group chats with sequential speaking
- finite-state conversation control (discussion/sync/async/pause/conclusion)
Optimization Features
Token Efficiency
- manual deduplication of messages reduces communication cost ~50% (reported)
System Optimization
- nested team formation reduces full-group communication complexity
Inference Optimization
- task decomposition to reduce per-agent work (reduces some agent costs)
Reproducibility
Code Urls
Data Urls
- GAIA
- RoCoBench
- TriviaQA
- Natural Questions
- HotpotQA
- 2WikiMultiHopQA
- open-ended instruction benchmark (self-instruct seeds)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Communication overhead: IoA adds token/message cost (reported $0.53 per task) and can produce redundant chat content.
- Agent matching is imperfect: Top@1 recall is 41.4% in regular settings, so exact partner selection can fail.
- Some security and production concerns are not fully implemented; the Security Module is acknowledged but not enforced.
- Experiments sometimes use validation subsets or simulated setups (budget and simulation constraints noted).
When Not To Use
- If minimum latency and minimal message traffic are critical (real-time hard‑real‑time control).
- When you cannot adapt third-party agents to the required run(task_desc: str) interface.
- If coordination tokens cost exceeds value and you cannot prune/reduce chat verbosity.
Failure Modes
- LLMs repeat or rephrase prior messages, causing stalled progress and higher token costs.
- Clients fail to switch to pause & trigger state, leading to missed synchronization points.
- Agent discovery returns semantically similar but functionally inadequate agents (imperfect matching).
- Security risks if untrusted third‑party agents join without stronger authentication.
Core Entities
Models
- GPT-4 (GPT-4-1106-preview used as judge)
- GPT-3.5-turbo-0125 (used as core LLM in some IoA configs)
- AutoGPT
- Open Interpreter
- ReAct agents
Metrics
- win rate (pairwise judged by GPT-4)
- success rate (RoCoBench)
- Accuracy
- Top@1/Top@10/MRR/MR (team formation)
Datasets
- GAIA
- RoCoBench
- TriviaQA
- Natural Questions (NQ)
- HotpotQA
- 2WikiMultiHopQA
- open-ended instruction benchmark (self-instruct, 153 tasks)
Benchmarks
- GAIA
- RoCoBench
- Open-ended instruction benchmark (153 tasks)
- RAG QA (4 datasets)

