MACNET: use directed acyclic graphs to scale LLM agents and show a logistic ‘collaborative scaling law’

June 11, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.8

Cost Impact Score

0.6

Citation Count

6

Authors

Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

You can improve quality on mixed tasks by running many cooperating LLM agents in a DAG and avoid expensive retraining; randomized wiring often gives a good speed-quality trade-off.

Summary TLDR

This paper introduces MACNET, a system that arranges LLM-driven agents into directed acyclic graphs (DAGs). Nodes run 'actors' that produce artifacts and edges run 'critics' that give refinement instructions. By propagating only refined artifacts (not full dialogues) and traversing in topological order, MACNET reduces context growth, supports collaboration at scale, and yields a logistic performance-vs.-size curve: improvements accelerate then saturate. Evaluations on MMLU, HumanEval, SRDD and CommonGen-Hard show MACNET variants beat several baselines; irregular (random) topologies balance quality and time best. Code: github.com/OpenBMB/ChatDev/tree/macnet.

Problem Statement

Existing multi-agent LLM systems rarely test large agent counts and often rely on simple voting or chain structures. We ask: how does continuous addition of collaborating agents affect performance, and can a scalable network design avoid context explosion while harnessing many agents?

Main Contribution

MACNET: a practical framework that maps agents to a DAG with actors on nodes and critics on edges to orchestrate iterative refinement.

A memory-control rule that propagates only final artifacts (not full dialogue), cutting worst-case token growth from quadratic to linear.

Empirical study across benchmarks (MMLU, HumanEval, SRDD, CommonGen-Hard) showing MACNET variants improve average quality and reveal a logistic 'collaborative scaling law'.

Design findings on topology: irregular/random topologies often give the best trade-off between quality and time; dense mesh helps quality but costs more tokens.

Key Findings

MACNET variants outperform multi-agent and single-agent baselines on average across diverse tasks.

NumbersQuality: MACNET-RANDOM 0.6522 vs AGENTVERSE 0.5805 (Table 1).

Irregular/random topologies can beat regular dense designs while running faster.

NumbersRandom topologies took ~51.92% less time than mesh while matching or exceeding quality (text & Fig.5).

Performance vs. agent scale follows a logistic (sigmoid) curve with early emergence and later saturation.

NumbersScaling nodes from 2^0 to 2^6 shows slow→fast→saturating growth; practical saturation around ~100 agents (Fig.7 & Sec.3.

Artifact-only propagation plus topological traversal reduces context/token growth from quadratic to linear in theory.

NumbersToken complexity without control grows ∝ n^2; with MACNET memory control it becomes linear in n (Section 2.3).

Critics effectively cause actors to implement refinements most of the time.

NumbersWhen a critic suggests an aspect, actors implement it with 93.10% probability (Section 3.4).

Results

Quality (average across tasks)

ValueMACNET-RANDOM 0.6522

BaselineAGENTVERSE 0.5805

Accuracy

ValueMACNET-CHAIN 0.6632

BaselineAGENTVERSE 0.2977

HumanEval (pass@k proxy)

ValueAGENTVERSE 0.7256 (best listed)

BaselineMACNET-CHAIN 0.3720

SRDD comprehensive

ValueMACNET-CHAIN 0.8056

BaselineCOT 0.7222

Topology timing trade-off

ValueRandom ~51.92% less wall time than mesh

BaselineMesh

Who Should Care

What To Try In 7 Days

Prototype a small MACNET: assign actor roles at nodes and critic roles on edges using GPT-3.5 or your model.

Enable artifact-only propagation (store only final artifacts), then measure tokens and latency versus full-dialogue passing.

Compare chain, star, and a randomized graph with 10–50 agents to find the best trade-off for your task.

Agent Features

Memory

  • Short-term memory for interaction context
  • Long-term memory stores only final artifacts (artifact-only propagation)

Planning

  • Topological ordering traversal
  • Iterative local refinement between critic and actor

Tool Use

  • Uses LLMs for reasoning (GPT-3.5 in experiments)
  • Supports agent profiles and external tools (profiles referenced)

Frameworks

  • MACNET (this paper)
  • ChatDev/macnet (code)

Is Agentic

true

Architectures

  • Directed Acyclic Graph (DAG)
  • Functional bipartition: actors (nodes) and critics (edges)

Collaboration

  • Dual-agent iterative refinement per edge (critic→actor→refine)
  • Aggregation at convergent nodes (hierarchical aggregation)

Optimization Features

Token Efficiency

  • Memory control changes worst-case token growth from O(n^2) to O(n)

Infra Optimization

  • Design supports scaling to hundreds/thousands of agent instances by limiting context per agent

System Optimization

  • Assign critics to edges and actors to nodes to split duties and reduce backflow
  • Randomized wiring to reduce average path length and time

Inference Optimization

  • Artifact-only propagation reduces tokens sent between agents
  • Topological traversal avoids global broadcasting

Reproducibility

Data Urls

  • MMLU (public)
  • HumanEval (public)
  • SRDD (Qian et al.)
  • CommonGen-Hard (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on the underlying LLM quality (experiments use GPT-3.5); gains may shrink with weaker models.
  • Dense meshes improve quality but dramatically increase token and time costs.
  • Topology choice matters: no single topology works best for all task types.
  • Experimental saturation and logistic fit are empirical and may shift with different profiles, tools, or models.

When Not To Use

  • When you have a single simple closed-domain task easily solved by a tuned single-model pipeline.
  • If API cost or latency is extremely tight and you cannot afford dozens of LLM calls.
  • When you cannot design or validate critic/actor roles for your domain.

Failure Modes

  • Context explosion if artifact-only propagation is not enforced.
  • Aggregation errors at convergent nodes leading to degraded artifacts.
  • High manual tuning need for node/edge roles and prompts in task-specific domains.
  • Diminishing returns or saturation beyond a practical agent count for a given task.

Core Entities

Models

  • GPT-3.5

Metrics

  • Accuracy
  • pass@k
  • comprehensive SRDD metric
  • composite CommonGen metric
  • Quality (average across tasks)

Datasets

  • MMLU
  • HumanEval
  • SRDD
  • CommonGen-Hard

Benchmarks

  • MMLU
  • HumanEval
  • SRDD
  • CommonGen-Hard