Overview
Production Readiness
0.6
Novelty Score
0.8
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
You can improve quality on mixed tasks by running many cooperating LLM agents in a DAG and avoid expensive retraining; randomized wiring often gives a good speed-quality trade-off.
Summary TLDR
This paper introduces MACNET, a system that arranges LLM-driven agents into directed acyclic graphs (DAGs). Nodes run 'actors' that produce artifacts and edges run 'critics' that give refinement instructions. By propagating only refined artifacts (not full dialogues) and traversing in topological order, MACNET reduces context growth, supports collaboration at scale, and yields a logistic performance-vs.-size curve: improvements accelerate then saturate. Evaluations on MMLU, HumanEval, SRDD and CommonGen-Hard show MACNET variants beat several baselines; irregular (random) topologies balance quality and time best. Code: github.com/OpenBMB/ChatDev/tree/macnet.
Problem Statement
Existing multi-agent LLM systems rarely test large agent counts and often rely on simple voting or chain structures. We ask: how does continuous addition of collaborating agents affect performance, and can a scalable network design avoid context explosion while harnessing many agents?
Main Contribution
MACNET: a practical framework that maps agents to a DAG with actors on nodes and critics on edges to orchestrate iterative refinement.
A memory-control rule that propagates only final artifacts (not full dialogue), cutting worst-case token growth from quadratic to linear.
Empirical study across benchmarks (MMLU, HumanEval, SRDD, CommonGen-Hard) showing MACNET variants improve average quality and reveal a logistic 'collaborative scaling law'.
Design findings on topology: irregular/random topologies often give the best trade-off between quality and time; dense mesh helps quality but costs more tokens.
Key Findings
MACNET variants outperform multi-agent and single-agent baselines on average across diverse tasks.
Irregular/random topologies can beat regular dense designs while running faster.
Performance vs. agent scale follows a logistic (sigmoid) curve with early emergence and later saturation.
Artifact-only propagation plus topological traversal reduces context/token growth from quadratic to linear in theory.
Critics effectively cause actors to implement refinements most of the time.
Results
Quality (average across tasks)
Accuracy
HumanEval (pass@k proxy)
SRDD comprehensive
Topology timing trade-off
Who Should Care
What To Try In 7 Days
Prototype a small MACNET: assign actor roles at nodes and critic roles on edges using GPT-3.5 or your model.
Enable artifact-only propagation (store only final artifacts), then measure tokens and latency versus full-dialogue passing.
Compare chain, star, and a randomized graph with 10–50 agents to find the best trade-off for your task.
Agent Features
Memory
- Short-term memory for interaction context
- Long-term memory stores only final artifacts (artifact-only propagation)
Planning
- Topological ordering traversal
- Iterative local refinement between critic and actor
Tool Use
- Uses LLMs for reasoning (GPT-3.5 in experiments)
- Supports agent profiles and external tools (profiles referenced)
Frameworks
- MACNET (this paper)
- ChatDev/macnet (code)
Is Agentic
true
Architectures
- Directed Acyclic Graph (DAG)
- Functional bipartition: actors (nodes) and critics (edges)
Collaboration
- Dual-agent iterative refinement per edge (critic→actor→refine)
- Aggregation at convergent nodes (hierarchical aggregation)
Optimization Features
Token Efficiency
- Memory control changes worst-case token growth from O(n^2) to O(n)
Infra Optimization
- Design supports scaling to hundreds/thousands of agent instances by limiting context per agent
System Optimization
- Assign critics to edges and actors to nodes to split duties and reduce backflow
- Randomized wiring to reduce average path length and time
Inference Optimization
- Artifact-only propagation reduces tokens sent between agents
- Topological traversal avoids global broadcasting
Reproducibility
Data Urls
- MMLU (public)
- HumanEval (public)
- SRDD (Qian et al.)
- CommonGen-Hard (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on the underlying LLM quality (experiments use GPT-3.5); gains may shrink with weaker models.
- Dense meshes improve quality but dramatically increase token and time costs.
- Topology choice matters: no single topology works best for all task types.
- Experimental saturation and logistic fit are empirical and may shift with different profiles, tools, or models.
When Not To Use
- When you have a single simple closed-domain task easily solved by a tuned single-model pipeline.
- If API cost or latency is extremely tight and you cannot afford dozens of LLM calls.
- When you cannot design or validate critic/actor roles for your domain.
Failure Modes
- Context explosion if artifact-only propagation is not enforced.
- Aggregation errors at convergent nodes leading to degraded artifacts.
- High manual tuning need for node/edge roles and prompts in task-specific domains.
- Diminishing returns or saturation beyond a practical agent count for a given task.
Core Entities
Models
- GPT-3.5
Metrics
- Accuracy
- pass@k
- comprehensive SRDD metric
- composite CommonGen metric
- Quality (average across tasks)
Datasets
- MMLU
- HumanEval
- SRDD
- CommonGen-Hard
Benchmarks
- MMLU
- HumanEval
- SRDD
- CommonGen-Hard

