A modular blueprint for running reliable multi-agent workflows with planning, tool refinement, and episodic memory

June 28, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

4

Authors

Noel Crawford, Edward B. Duffy, Iman Evazzade, Torsten Foehr, Gregory Robbins, Debbrata Kumar Saha, Jiya Varma, Marcin Ziolkowski

Links

Abstract / PDF

Why It Matters For Business

Gives a reusable engineering blueprint to run reliable, auditable multi-agent automation across existing enterprise systems without retraining models.

Summary TLDR

This paper presents a practical engineering blueprint for building multi-agent systems driven by large language models (LLMs). It specifies modular components (Planner, Executor, Verifier, Agent Units, Matchers), prompt strategies (ReAct variants, Programmable Prompt, ConvPlanReAct), tool handling (tool schema + Toolbox Refiner), and two memory levels (short per-task memory and episodic vector DB). The design highlights five multi-agent patterns (Independent, Sequential, Joint, Hierarchical, Broadcast), human-in-loop options, and resume/restart behavior for production-grade automation. No new model weights or benchmark experiments are provided.

Problem Statement

Current LLMs are powerful but lack direct access to proprietary systems and reliable multi-step execution. Organizations need a reusable engineering pattern to compose narrow expert agents, orchestrate tools and memory, verify results, and scale multi-agent workflows in enterprise IT environments.

Main Contribution

A modular agent engineering framework that separates Planning, Execution, and Verification and fits mixed modern/legacy IT.

ConvPlanReAct: a conversational extension of ReAct/PlanReAct that adds dialog-aware steps and explicit next-agent selection (@AgentName).

Tool management via Tool Abstraction and a Toolbox Refiner that narrows relevant tools per task to improve tool-selection accuracy.

Two-level memory: Short Memory for per-task iterative prompts and Episodic Memory (vector DB) for cross-task retrieval and experiential learning.

Definition of Agent Unit and Matcher abstractions plus five multi-agent execution patterns: Independent, Sequential, Joint, Hierarchical, Broadcast.

Guidance for human-in-the-loop interactions (intentional and incidental) and workflow restart/resume behavior using Episodic Memory.

Key Findings

Narrow, persona-like agents perform more reliably than broad agents.

Multi-agent workflows can be realized as five practical patterns: Independent, Sequential, Joint, Hierarchical, Broadcast.

Tool overload reduces LLM selection accuracy; a Toolbox Refiner improves scaling by returning a task-relevant subset.

Episodic Memory (vector DB of completed task episodes) enables retrieval of indirect dependencies and experiential learning across plans.

Verifier agents provide an independent boolean check on final results without seeing the plan to reduce verification bias.

Who Should Care

What To Try In 7 Days

Prototype a Planner + Task Queue to decompose one recurring business process.

Wrap three narrow agents (e.g., Coder, Architect, Tester) and test a Joint workflow on a small coding task.

Add a Toolbox Refiner to limit tool list and measure tool-selection stability.

Agent Features

Memory

  • Short Memory: per-task prompt history
  • Episodic Memory: vector DB of completed task episodes

Planning

  • Planner agent produces DAG of tasks
  • PlanReAct and ConvPlanReAct for iterative planning

Tool Use

  • Tool abstraction via explicit input/output schema in prompts
  • Toolbox Refiner (Identity, Hierarchical, Semantic)

Frameworks

  • ConvPlanReAct
  • Programmable Prompt
  • ReAct
  • PlanReAct

Is Agentic

true

Architectures

  • Modular Plan-Execute-Verify pipeline
  • Agent Unit (one or more agents per unit)

Collaboration

  • Independent
  • Sequential
  • Joint
  • Hierarchical
  • Broadcast
  • Agent matching (semantic, mention, sequence)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No quantitative experiments or benchmarks are reported to validate effectiveness.
  • Framework assumes access to reliable external tools/APIs for many workflows.
  • Verifier is a boolean check; richer verification metrics are not specified.
  • No standardized tool input schema across tools; schema varies per tool and must be implemented.
  • Human-in-the-loop introduces latency and requires operational handling not fully specified.

When Not To Use

  • Safety-critical systems requiring formal guarantees and audit trails beyond heuristic verification.
  • Environments with no stable APIs or tools to perform external actions.
  • When you need measured performance or accuracy benchmarks before deployment.

Failure Modes

  • Agent hallucination leading to incorrect actions or tool calls.
  • Wrong agent selection due to imperfect matchers or ambiguous personas.
  • Tool misuse because of mismatched input schema or insufficient tool documentation.
  • Verifier bias or insufficiency giving false positives/negatives.
  • Workflow stalls from human-in-the-loop delays or unavailable external services.

Core Entities

Models

  • Large Language Models (unspecified)

Context Entities

Models

  • AutoGen
  • AutoGPT
  • LangChain
  • LlamaIndex
  • MetaGPT
  • AgentVerse
  • AgentLite