MACNET: use directed acyclic graphs to scale LLM agents and show a logistic ‘collaborative scaling law’

Overview

Decision SnapshotNeeds Validation

MACNET presents a clear, reproducible system-level design and empirical gains, but results depend on base LLM quality and topology choices; expect engineering work to tune scale and wiring for your tasks.

Citations6

Evidence Strength0.75

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 80%

Authors

Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can improve quality on mixed tasks by running many cooperating LLM agents in a DAG and avoid expensive retraining; randomized wiring often gives a good speed-quality trade-off.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper introduces MACNET, a system that arranges LLM-driven agents into directed acyclic graphs (DAGs). Nodes run 'actors' that produce artifacts and edges run 'critics' that give refinement instructions. By propagating only refined artifacts (not full dialogues) and traversing in topological order, MACNET reduces context growth, supports collaboration at scale, and yields a logistic performance-vs.-size curve: improvements accelerate then saturate. Evaluations on MMLU, HumanEval, SRDD and CommonGen-Hard show MACNET variants beat several baselines; irregular (random) topologies balance quality and time best. Code: github.com/OpenBMB/ChatDev/tree/macnet.

Problem Statement

Existing multi-agent LLM systems rarely test large agent counts and often rely on simple voting or chain structures. We ask: how does continuous addition of collaborating agents affect performance, and can a scalable network design avoid context explosion while harnessing many agents?

Main Contribution

MACNET: a practical framework that maps agents to a DAG with actors on nodes and critics on edges to orchestrate iterative refinement.

A memory-control rule that propagates only final artifacts (not full dialogue), cutting worst-case token growth from quadratic to linear.

Key Findings

MACNET variants outperform multi-agent and single-agent baselines on average across diverse tasks.

NumbersQuality: MACNET-RANDOM 0.6522 vs AGENTVERSE 0.5805 (Table 1).

Practical UseFor mixed tasks (knowledge, coding, software dev, generation), try MACNET-style DAG networks instead of single-agent prompts or simple majority voting to improve end-to-end quality.

Evidence RefTable 1

Irregular/random topologies can beat regular dense designs while running faster.

NumbersRandom topologies took ~51.92% less time than mesh while matching or exceeding quality (text & Fig.5).

Practical UseUse randomized or small-world edge wiring to balance performance and latency rather than defaulting to fully connected meshes.

Evidence RefFigure 5 and Section 3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Quality (average across tasks)	MACNET-RANDOM 0.6522	AGENTVERSE 0.5805	+0.0717	—	Table 1 reports MACNET-RANDOM Quality 0.6522 versus baseline AGENTVERSE 0.5805.	Table 1
Accuracy	MACNET-CHAIN 0.6632	AGENTVERSE 0.2977	+0.3655	MMLU	Table 1 shows MACNET-CHAIN MMLU 0.6632 vs AGENTVERSE 0.2977.	Table 1

What To Try In 7 Days

Prototype a small MACNET: assign actor roles at nodes and critic roles on edges using GPT-3.5 or your model.

Enable artifact-only propagation (store only final artifacts), then measure tokens and latency versus full-dialogue passing.

Compare chain, star, and a randomized graph with 10–50 agents to find the best trade-off for your task.

Agent Features

Memory

Short-term memory for interaction contextLong-term memory stores only final artifacts (artifact-only propagation)

Planning

Topological ordering traversalIterative local refinement between critic and actor

Tool Use

Uses LLMs for reasoning (GPT-3.5 in experiments)Supports agent profiles and external tools (profiles referenced)

Frameworks

MACNET (this paper)ChatDev/macnet (code)

Is Agentic

Yes

Architectures

Directed Acyclic Graph (DAG)Functional bipartition: actors (nodes) and critics (edges)

Collaboration

Dual-agent iterative refinement per edge (critic→actor→refine)Aggregation at convergent nodes (hierarchical aggregation)

Optimization Features

Token Efficiency

Memory control changes worst-case token growth from O(n^2) to O(n)

Infra Optimization

Design supports scaling to hundreds/thousands of agent instances by limiting context per agent

System Optimization

Assign critics to edges and actors to nodes to split duties and reduce backflowRandomized wiring to reduce average path length and time

Inference Optimization

Artifact-only propagation reduces tokens sent between agentsTopological traversal avoids global broadcasting

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/OpenBMB/ChatDev/tree/macnet

Data URLs

MMLU (public)HumanEval (public)SRDD (Qian et al.)CommonGen-Hard (public)

Risks & Boundaries

Limitations

Relies on the underlying LLM quality (experiments use GPT-3.5); gains may shrink with weaker models.

Dense meshes improve quality but dramatically increase token and time costs.

When Not To Use

When you have a single simple closed-domain task easily solved by a tuned single-model pipeline.

If API cost or latency is extremely tight and you cannot afford dozens of LLM calls.

Failure Modes

Context explosion if artifact-only propagation is not enforced.

Aggregation errors at convergent nodes leading to degraded artifacts.

Core Entities

Models

GPT-3.5

Metrics

Accuracypass@kcomprehensive SRDD metriccomposite CommonGen metricQuality (average across tasks)

Datasets

MMLUHumanEvalSRDDCommonGen-Hard

Benchmarks

MMLUHumanEvalSRDDCommonGen-Hard

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MACNET variants outperform multi-agent and single-agent baselines on average across diverse tasks.

Irregular/random topologies can beat regular dense designs while running faster.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding