An Internet-like platform that links diverse LLM agents into dynamic teams and chat groups

July 9, 202410 min

Overview

Decision SnapshotNeeds Validation

The system is a complete prototype with public code and multiple benchmark results; it works well in experiments but adds coordination costs and needs prompts/protocol tuning for efficient deployment.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/9

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 55%

Authors

Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

IoA lets you combine existing specialized agents into coordinated teams to raise task success without re-training models; expect better QA and tool use at the cost of coordination tokens and some extra infra.

Who Should Care

Summary TLDR

IoA is a software framework that treats autonomous agents like users in an instant-messaging system: agents register, discover peers, form nested teams, follow a finite-state conversation flow, and assign tasks. Across four domains (tool use, heterogeneous architectures, embodied agents, and retrieval-augmented QA) IoA often beats single-agent baselines and some multi-agent systems. Key trade-offs: improved task success and flexibility at the cost of message overhead and extra coordination tokens. Code is public.

Problem Statement

Existing multi-agent frameworks are limited by ecosystem isolation (hard to plug in third‑party agents), single-device simulation, and rigid, hard-coded communication. The paper asks: can we build a scalable, Internet-like platform that lets diverse agents discover each other, form dynamic teams, and coordinate via flexible conversation states?

Main Contribution

An agent-integration protocol and client/server design that lets third-party agents register and communicate over the network.

An instant-messaging-style architecture with group chats, nested subgroups, and team-formation tooling.

Key Findings

IoA substantially improves open-ended instruction wins when it orchestrates third-party agents.

NumbersWin rate vs AutoGPT: 76.5%; vs Open Interpreter: 63.4%

Practical UseIf you need a conductor that combines multiple specialized agents, IoA tends to produce better final answers than running each agent alone; integrate existing agents via the protocol and let IoA handle teaming.

Evidence RefSection 3.2 and Fig.5

IoA matches or exceeds single-model RAG baselines even when built on GPT-3.5.

NumbersIoA +3 agents (homogeneous) overall: 0.671 vs GPT-4 overall: 0.611 (on four QA datasets)

Practical UseFor retrieval-augmented QA, assembling multiple retriever/agent pipelines in IoA can give accuracy similar or better than a stronger single model — useful when you scale agents rather than upgrade base LLMs.

Evidence RefSection 3.4 Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Open-ended instruction win rate vs AutoGPT76.5%AutoGPT153 open-ended instructions (Section 3.2)IoA wins 76.5% when judged by GPT-4Section 3.2
Open-ended instruction win rate vs Open Interpreter63.4%Open Interpreter153 open-ended instructions (Section 3.2)IoA wins 63.4% when judged by GPT-4Section 3.2

What To Try In 7 Days

Wrap two complementary agents with IoA's client API and run a few tasks to compare combined output vs running them separately.

Enable message deduplication and limit group-chat turns to cut token bills; measure cost delta.

Use IoA for a retrieval-augmented QA pipeline: assign separate retrievers to agents and compare combined accuracy to a single stronger model.

Agent Features

Memory
local Group Info and Task Management modules (SQLite)session state via server registry
Planning
nested team planning (hierarchical subgroups)task decomposition via LLM prompts
Tool Use
browser, code interpreter, Wikidata search, YouTube transcript toolretrieval tools (Pyserini, Google Search API)
Frameworks
client/server architecture with WebSocketAgent Registry + Milvus similarity search
Is Agentic

Yes

Architectures
LLM-based agents (GPT-3.5/GPT-4 wrappers)third-party agents (AutoGPT, Open Interpreter)tool-augmented ReAct agents
Collaboration
agent discovery & searchgroup chats with sequential speakingfinite-state conversation control (discussion/sync/async/pause/conclusion)

Optimization Features

Token Efficiency
manual deduplication of messages reduces communication cost ~50% (reported)
System Optimization
nested team formation reduces full-group communication complexity
Inference Optimization
task decomposition to reduce per-agent work (reduces some agent costs)

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

GAIARoCoBenchTriviaQANatural QuestionsHotpotQA2WikiMultiHopQAopen-ended instruction benchmark (self-instruct seeds)

Risks & Boundaries

Limitations

Communication overhead: IoA adds token/message cost (reported $0.53 per task) and can produce redundant chat content.

Agent matching is imperfect: Top@1 recall is 41.4% in regular settings, so exact partner selection can fail.

When Not To Use

If minimum latency and minimal message traffic are critical (real-time hard‑real‑time control).

When you cannot adapt third-party agents to the required run(task_desc: str) interface.

Failure Modes

LLMs repeat or rephrase prior messages, causing stalled progress and higher token costs.

Clients fail to switch to pause & trigger state, leading to missed synchronization points.

Core Entities

Models

GPT-4 (GPT-4-1106-preview used as judge)GPT-3.5-turbo-0125 (used as core LLM in some IoA configs)AutoGPTOpen InterpreterReAct agents

Metrics

win rate (pairwise judged by GPT-4)success rate (RoCoBench)AccuracyTop@1/Top@10/MRR/MR (team formation)

Datasets

GAIARoCoBenchTriviaQANatural Questions (NQ)HotpotQA2WikiMultiHopQAopen-ended instruction benchmark (self-instruct, 153 tasks)

Benchmarks

GAIARoCoBenchOpen-ended instruction benchmark (153 tasks)RAG QA (4 datasets)