A modular graph framework that lets multiple LLM agents collaborate, create agents, and supervise each other

June 5, 20236 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

50

Authors

Yashar Talebirad, Amirhossein Nadiri

Links

Abstract / PDF

Why It Matters For Business

Modular LLM agents let teams split complex workflows, add verifiers to reduce costly errors, and plug in APIs safely — but they add orchestration costs and governance requirements.

Summary TLDR

This paper proposes a formal, graph-based framework for multi-agent systems built from large language model (LLM) instances. Agents are defined as tuples (model, role, state, can-create, can-halt). The framework adds plugins (APIs, vector DBs, retrievers), feedback loops (inter-agent and self-feedback), stateless oracle agents, supervisor/halting controls, and dynamic agent creation. Authors map the design onto Auto-GPT, BabyAGI and Gorilla as case studies and discuss practical limits: looping, security, scalability, evaluation and ethics. The paper is conceptual and contains no empirical benchmarks.

Problem Statement

LLMs are powerful but usually act alone. Real tasks need modular, coordinated behavior, safer API use, and ways to detect looping or hallucination. The paper offers a formal multi-agent layout to make LLMs collaborate, delegate, verify, and scale, while highlighting governance and evaluation gaps.

Main Contribution

A graph-based formalization where nodes are agents or plugins and edges are communication channels.

A compact agent tuple A = (L, R, S, C, H): model, role, state, create-permission, halt-list.

Design patterns: plugins for tools/APIs, oracle (stateless) agents, supervisor agents, and feedback/self-feedback loops.

Support for dynamic agent creation and halting to manage workloads and correct bad behavior.

Applied mappings to Auto-GPT, BabyAGI, and Gorilla plus two case studies (court simulation, software development).

Key Findings

Agents can be modeled as tuples (L, R, S, C, H) to standardize behavior and permissions.

The framework covers three real systems as case studies (Auto-GPT, BabyAGI, Gorilla).

Numbers3 case studies

Supervisor and oracle agents are proposed to catch loops, hallucinations, and unsafe API actions.

Dynamic agent creation increases flexibility but risks resource exhaustion and conflicts.

Who Should Care

What To Try In 7 Days

Build a two-agent prototype: one with memory plugin, one with web-access plugin; test a simple task.

Add a stateless oracle that verifies outputs before action for high-risk steps.

Model an existing LLM pipeline (e.g., BabyAGI) as separate agents plus a vector DB plugin and run a few tasks end-to-end.

Agent Features

Memory

  • short-term and long-term via plugins
  • stateless oracle (no memory)

Planning

  • task decomposition
  • dynamic agent creation
  • role assignment
  • supervisor-driven halting

Tool Use

  • plugins for APIs
  • vector DB for context
  • document retriever
  • code execution plugins

Frameworks

  • graph-based black box
  • tuple agent representation (L,R,S,C,H)

Is Agentic

true

Architectures

  • GPT-4
  • GPT-3.5-turbo
  • LLaMA
  • Auto-GPT
  • BabyAGI
  • Gorilla

Collaboration

  • inter-agent messaging
  • feedback / self-feedback loops
  • shared boards / shared storage

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • No empirical evaluation or benchmarks provided.
  • Dynamic agent creation can lead to resource exhaustion without quotas.
  • Scalability and coordination become harder as agent count grows.
  • Security risks from code execution and API access need engineering safeguards.
  • Ethical and evaluation frameworks are not fully specified.

When Not To Use

  • When you need proven, benchmarked performance backed by experiments.
  • For ultra-low-latency single-query services where orchestration adds delay.
  • In tightly regulated settings without strict governance for API access and audit trails.

Failure Modes

  • Agents get stuck in loops and fail to progress.
  • Uncontrolled spawning of agents exhausts compute resources.
  • Conflicting or overlapping roles cause redundant or contradictory actions.
  • Hallucinations lead to unsafe API calls if not checked by oracle.
  • Evaluation blind spots because traditional metrics don't capture multi-agent dynamics.

Core Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • LLaMA
  • Auto-GPT
  • BabyAGI
  • Gorilla

Context Entities

Models

  • GPT-4
  • GPT-3.5-turbo
  • LLaMA
  • Gorilla
  • Auto-GPT
  • BabyAGI