An Internet-like platform that links diverse LLM agents into dynamic teams and chat groups

Overview

Decision SnapshotNeeds Validation

The system is a complete prototype with public code and multiple benchmark results; it works well in experiments but adds coordination costs and needs prompts/protocol tuning for efficient deployment.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 55%

Authors

Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

IoA lets you combine existing specialized agents into coordinated teams to raise task success without re-training models; expect better QA and tool use at the cost of coordination tokens and some extra infra.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

IoA is a software framework that treats autonomous agents like users in an instant-messaging system: agents register, discover peers, form nested teams, follow a finite-state conversation flow, and assign tasks. Across four domains (tool use, heterogeneous architectures, embodied agents, and retrieval-augmented QA) IoA often beats single-agent baselines and some multi-agent systems. Key trade-offs: improved task success and flexibility at the cost of message overhead and extra coordination tokens. Code is public.

Problem Statement

Existing multi-agent frameworks are limited by ecosystem isolation (hard to plug in third‑party agents), single-device simulation, and rigid, hard-coded communication. The paper asks: can we build a scalable, Internet-like platform that lets diverse agents discover each other, form dynamic teams, and coordinate via flexible conversation states?

Main Contribution

An agent-integration protocol and client/server design that lets third-party agents register and communicate over the network.

An instant-messaging-style architecture with group chats, nested subgroups, and team-formation tooling.

Key Findings

IoA substantially improves open-ended instruction wins when it orchestrates third-party agents.

NumbersWin rate vs AutoGPT: 76.5%; vs Open Interpreter: 63.4%

Practical UseIf you need a conductor that combines multiple specialized agents, IoA tends to produce better final answers than running each agent alone; integrate existing agents via the protocol and let IoA handle teaming.

Evidence RefSection 3.2 and Fig.5

IoA matches or exceeds single-model RAG baselines even when built on GPT-3.5.

NumbersIoA +3 agents (homogeneous) overall: 0.671 vs GPT-4 overall: 0.611 (on four QA datasets)

Practical UseFor retrieval-augmented QA, assembling multiple retriever/agent pipelines in IoA can give accuracy similar or better than a stronger single model — useful when you scale agents rather than upgrade base LLMs.

Evidence RefSection 3.4 Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Open-ended instruction win rate vs AutoGPT	76.5%	AutoGPT	—	153 open-ended instructions (Section 3.2)	IoA wins 76.5% when judged by GPT-4	Section 3.2
Open-ended instruction win rate vs Open Interpreter	63.4%	Open Interpreter	—	153 open-ended instructions (Section 3.2)	IoA wins 63.4% when judged by GPT-4	Section 3.2

What To Try In 7 Days

Wrap two complementary agents with IoA's client API and run a few tasks to compare combined output vs running them separately.

Enable message deduplication and limit group-chat turns to cut token bills; measure cost delta.

Use IoA for a retrieval-augmented QA pipeline: assign separate retrievers to agents and compare combined accuracy to a single stronger model.

Agent Features

Memory

local Group Info and Task Management modules (SQLite)session state via server registry

Planning

nested team planning (hierarchical subgroups)task decomposition via LLM prompts

Tool Use

browser, code interpreter, Wikidata search, YouTube transcript toolretrieval tools (Pyserini, Google Search API)

Frameworks

client/server architecture with WebSocketAgent Registry + Milvus similarity search

Is Agentic

Yes

Architectures

LLM-based agents (GPT-3.5/GPT-4 wrappers)third-party agents (AutoGPT, Open Interpreter)tool-augmented ReAct agents

Collaboration

agent discovery & searchgroup chats with sequential speakingfinite-state conversation control (discussion/sync/async/pause/conclusion)

Optimization Features

Token Efficiency

manual deduplication of messages reduces communication cost ~50% (reported)

System Optimization

nested team formation reduces full-group communication complexity

Inference Optimization

task decomposition to reduce per-agent work (reduces some agent costs)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/OpenBMB/IoA

Data URLs

GAIARoCoBenchTriviaQANatural QuestionsHotpotQA2WikiMultiHopQAopen-ended instruction benchmark (self-instruct seeds)

Risks & Boundaries

Limitations

Communication overhead: IoA adds token/message cost (reported $0.53 per task) and can produce redundant chat content.

Agent matching is imperfect: Top@1 recall is 41.4% in regular settings, so exact partner selection can fail.

When Not To Use

If minimum latency and minimal message traffic are critical (real-time hard‑real‑time control).

When you cannot adapt third-party agents to the required run(task_desc: str) interface.

Failure Modes

LLMs repeat or rephrase prior messages, causing stalled progress and higher token costs.

Clients fail to switch to pause & trigger state, leading to missed synchronization points.

Core Entities

Models

GPT-4 (GPT-4-1106-preview used as judge)GPT-3.5-turbo-0125 (used as core LLM in some IoA configs)AutoGPTOpen InterpreterReAct agents

Metrics

win rate (pairwise judged by GPT-4)success rate (RoCoBench)AccuracyTop@1/Top@10/MRR/MR (team formation)

Datasets

GAIARoCoBenchTriviaQANatural Questions (NQ)HotpotQA2WikiMultiHopQAopen-ended instruction benchmark (self-instruct, 153 tasks)

Benchmarks

GAIARoCoBenchOpen-ended instruction benchmark (153 tasks)RAG QA (4 datasets)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

IoA substantially improves open-ended instruction wins when it orchestrates third-party agents.

IoA matches or exceeds single-model RAG baselines even when built on GPT-3.5.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding