An Internet-like platform that links diverse LLM agents into dynamic teams and chat groups

July 9, 202410 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.4

Citation Count

3

Authors

Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

IoA lets you combine existing specialized agents into coordinated teams to raise task success without re-training models; expect better QA and tool use at the cost of coordination tokens and some extra infra.

Summary TLDR

IoA is a software framework that treats autonomous agents like users in an instant-messaging system: agents register, discover peers, form nested teams, follow a finite-state conversation flow, and assign tasks. Across four domains (tool use, heterogeneous architectures, embodied agents, and retrieval-augmented QA) IoA often beats single-agent baselines and some multi-agent systems. Key trade-offs: improved task success and flexibility at the cost of message overhead and extra coordination tokens. Code is public.

Problem Statement

Existing multi-agent frameworks are limited by ecosystem isolation (hard to plug in third‑party agents), single-device simulation, and rigid, hard-coded communication. The paper asks: can we build a scalable, Internet-like platform that lets diverse agents discover each other, form dynamic teams, and coordinate via flexible conversation states?

Main Contribution

An agent-integration protocol and client/server design that lets third-party agents register and communicate over the network.

An instant-messaging-style architecture with group chats, nested subgroups, and team-formation tooling.

A finite-state conversation flow (discussion, sync/async assignment, pause & trigger, conclusion) driven by LLM decisions.

Demonstrations across GAIA (tools), an open-ended instruction set (heterogeneous architectures), RoCoBench (embodied tasks), and RAG QA; shows wins over several baselines.

Public code release: https://github.com/OpenBMB/IoA

Key Findings

IoA substantially improves open-ended instruction wins when it orchestrates third-party agents.

NumbersWin rate vs AutoGPT: 76.5%; vs Open Interpreter: 63.4%

IoA matches or exceeds single-model RAG baselines even when built on GPT-3.5.

NumbersIoA +3 agents (homogeneous) overall: 0.671 vs GPT-4 overall: 0.611 (on four QA datasets)

On GAIA (tool-heavy benchmark) IoA gives the top overall validation score using four ReAct agents.

NumbersGAIA overall (validation) reported ~39.39–40.00 for IoA (highest in table)

IoA achieves strong embodied-agent performance and often outperforms a domain-specific baseline.

NumbersRoCoBench: Cabinet=1.00 (4.6 steps), Sandwich=1.00 (8.9), Sort=1.00 (5.8); outperforms Roco Dialog on 4/5 tasks

Autonomous team formation has measurable precision but is imperfect.

NumbersTeam formation: Regular Top@1=41.4%, Top@10=64.9%; Nested Top@1=59.7%, Top@10=81.8%

Communication increases costs; removing repeated messages halves token costs in experiments.

NumbersIoA communication cost $0.53 per task (deduplicated $0.28); overall IoA cost $0.99 (dedup $0.74)

Results

Open-ended instruction win rate vs AutoGPT

Value76.5%

BaselineAutoGPT

Open-ended instruction win rate vs Open Interpreter

Value63.4%

BaselineOpen Interpreter

Accuracy

Value0.671

BaselineGPT-4 overall 0.611

Accuracy

Value0.61

BaselineApollo's Oracle (homogeneous) 0.597

RoCoBench success rates

ValueCabinet 1.00, Sweep 0.80, Sandwich 1.00, Sort 1.00, Rope 0.70

BaselineRoco Dialog / Central Plan (varies per task)

GAIA validation overall

Value39.39 / 40.00 (reported entries)

BaselineAutoGen and other baselines

Team formation precision (regular)

ValueTop@1 41.4%, Top@10 64.9%, MR 27.4, MRR 50.1%

Baselinesimulated GPT-4 labels

Team formation precision (nested)

ValueTop@1 59.7%, Top@10 81.8%, MR 10.6, MRR 66.5%

Baselinesimulated GPT-4 labels

Cost per task (open-ended instruction benchmark)

ValueIoA overall $0.99; dedup $0.74

BaselineAutoGPT standalone $0.39; Open Interpreter standalone $0.16

Who Should Care

What To Try In 7 Days

Wrap two complementary agents with IoA's client API and run a few tasks to compare combined output vs running them separately.

Enable message deduplication and limit group-chat turns to cut token bills; measure cost delta.

Use IoA for a retrieval-augmented QA pipeline: assign separate retrievers to agents and compare combined accuracy to a single stronger model.

Agent Features

Memory

  • local Group Info and Task Management modules (SQLite)
  • session state via server registry

Planning

  • nested team planning (hierarchical subgroups)
  • task decomposition via LLM prompts

Tool Use

  • browser, code interpreter, Wikidata search, YouTube transcript tool
  • retrieval tools (Pyserini, Google Search API)

Frameworks

  • client/server architecture with WebSocket
  • Agent Registry + Milvus similarity search

Is Agentic

true

Architectures

  • LLM-based agents (GPT-3.5/GPT-4 wrappers)
  • third-party agents (AutoGPT, Open Interpreter)
  • tool-augmented ReAct agents

Collaboration

  • agent discovery & search
  • group chats with sequential speaking
  • finite-state conversation control (discussion/sync/async/pause/conclusion)

Optimization Features

Token Efficiency

  • manual deduplication of messages reduces communication cost ~50% (reported)

System Optimization

  • nested team formation reduces full-group communication complexity

Inference Optimization

  • task decomposition to reduce per-agent work (reduces some agent costs)

Reproducibility

Data Urls

  • GAIA
  • RoCoBench
  • TriviaQA
  • Natural Questions
  • HotpotQA
  • 2WikiMultiHopQA
  • open-ended instruction benchmark (self-instruct seeds)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Communication overhead: IoA adds token/message cost (reported $0.53 per task) and can produce redundant chat content.
  • Agent matching is imperfect: Top@1 recall is 41.4% in regular settings, so exact partner selection can fail.
  • Some security and production concerns are not fully implemented; the Security Module is acknowledged but not enforced.
  • Experiments sometimes use validation subsets or simulated setups (budget and simulation constraints noted).

When Not To Use

  • If minimum latency and minimal message traffic are critical (real-time hard‑real‑time control).
  • When you cannot adapt third-party agents to the required run(task_desc: str) interface.
  • If coordination tokens cost exceeds value and you cannot prune/reduce chat verbosity.

Failure Modes

  • LLMs repeat or rephrase prior messages, causing stalled progress and higher token costs.
  • Clients fail to switch to pause & trigger state, leading to missed synchronization points.
  • Agent discovery returns semantically similar but functionally inadequate agents (imperfect matching).
  • Security risks if untrusted third‑party agents join without stronger authentication.

Core Entities

Models

  • GPT-4 (GPT-4-1106-preview used as judge)
  • GPT-3.5-turbo-0125 (used as core LLM in some IoA configs)
  • AutoGPT
  • Open Interpreter
  • ReAct agents

Metrics

  • win rate (pairwise judged by GPT-4)
  • success rate (RoCoBench)
  • Accuracy
  • Top@1/Top@10/MRR/MR (team formation)

Datasets

  • GAIA
  • RoCoBench
  • TriviaQA
  • Natural Questions (NQ)
  • HotpotQA
  • 2WikiMultiHopQA
  • open-ended instruction benchmark (self-instruct, 153 tasks)

Benchmarks

  • GAIA
  • RoCoBench
  • Open-ended instruction benchmark (153 tasks)
  • RAG QA (4 datasets)