AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

September 13, 20257 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.2

Citation Count

0

Authors

Tara Bogavelli, Roshnee Sharma, Hari Subramani

Links

Abstract / PDF

Why It Matters For Business

Agent architectures and model choice change real-world success and reliability; pick and test combinations rather than trusting a single best design.

Summary TLDR

AgentArch measures how 18 agent architectures (single vs multi-agent, ReAct vs function-calling, memory styles, thinking tools) perform on two enterprise workflows. Best end-to-end success rates are still low: 70.8% on a simple "Request Time Off" task (GPT-4.1) and 35.3% on a complex "Customer Routing" task (Sonnet 4). Function calling usually beats ReAct; thinking tools help non-reasoning models on simple workflows; multi-agent ReAct performs poorly and often hallucinates. The benchmark shows large, model-specific architecture effects and low reliability across repeated trials (pass^k peak 0.0634).

Problem Statement

Enterprise teams need guidance on which agentic architecture to choose. Prior work tests components in isolation; practitioners lack systematic evidence on how orchestration, prompting style, memory, and reasoning tools interact in real enterprise workflows.

Main Contribution

AgentArch benchmark evaluating 18 agentic configurations across six LLMs on two realistic enterprise workflows.

Joint analysis of four design dimensions: orchestration, agent prompting style (function calling vs ReAct), memory sharing (complete vs summarized), and thinking-tool integration.

Quantitative results showing model-specific architecture preferences, reliability gaps, and trade-offs between decision accuracy and end-to-end execution.

Key Findings

Top end-to-end (Acceptable pass@1) scores remain low on enterprise tasks.

NumbersTO best 70.8% (GPT-4.1); CR best 35.3% (Sonnet 4).

Function calling generally outperforms ReAct across models and tasks.

NumbersFunction-calling cells show higher pass@1 in heatmaps (Fig.3).

Thinking tools help non-reasoning models on simple calculation-heavy tasks.

NumbersGPT-4.1 improved 48.5% → 70.8% on TO with thinking tools enabled.

Multi-agent ReAct configurations cause many hallucinations and underperform.

NumbersSonnet 4 hallucination rates ~36% in multi-agent ReAct vs 0% in other configs.

Models show strong, differing preferences for architectures and large variance.

Numberso3-mini CV=143.7%; GPT-4.1 CV=27.0%; Sonnet 4 CV=32.1% (TO).

End-to-end reliability across repeated trials is very low.

NumbersPassˆk peak = 0.0634 (6.34%).

Results

Acceptable pass@1 (best on TO)

Value70.8% (GPT-4.1, single-agent FC, summarized memory, thinking tools)

Acceptable pass@1 (best on CR)

Value35.3% (Claude Sonnet 4, single-agent function calling)

PassˆK (all 8 trials succeed) peak

Value0.0634

Correct final decision (GPT-4.1, CR, multi-agent FC)

Value97-99% correct final decision

Who Should Care

What To Try In 7 Days

Run AgentArch or a small subset on your own workflows to find model-architecture fits.

Use function-calling prompts first for tool-heavy workflows and compare against ReAct on one task.

Enable thinking tools (math/summarize) for tasks that require calculations or aggregation; measure latency trade-offs.

Agent Features

Memory

  • complete_memory
  • summarized_memory

Planning

  • ReAct
  • function_calling

Tool Use

  • function_calling
  • thinking_tools

Frameworks

  • ReAct_prompt
  • function_calling_API

Is Agentic

true

Architectures

  • single_agent
  • multi_agent
  • orchestrator_isolated
  • orchestrator_open_network

Collaboration

  • orchestrator_mediated
  • agent_to_agent

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only two use cases (60 samples each) — not covering broad enterprise diversity.
  • Six models tested; limited open-source and reasoning-model coverage.
  • Text-only tools and inputs; no multimodal tools or files.
  • All runs at temperature=0 — sampling interactions with architecture unexplored.
  • Acceptable Score requires perfect tool+args+outcome and may undercount partial business value.

When Not To Use

  • If your workflow is multimodal (images, PDFs) — benchmark is text-only.
  • If you need conversational, user-in-the-loop workflows — this benchmark focuses on autonomous runs.
  • If you want broad cross-industry claims — only two specific enterprise tasks were tested.

Failure Modes

  • Hallucinated tools or agents (especially in multi-agent ReAct).
  • Wrong tool arguments causing failed side effects despite correct final decision reasoning.
  • High variance across architecture choices causing unpredictability.
  • Low multi-trial reliability (very low passˆK).

Core Entities

Models

  • GPT-4.1
  • GPT-4o
  • GPT-4.1-mini
  • o3-mini
  • LLaMA 3.3 70B
  • Claude Sonnet 4

Metrics

  • Acceptable Score (tools + args + outcome)
  • Acceptable pass@1
  • PassˆK (all k trials succeed)
  • Hallucination rate
  • Tool repetition rate
  • Missing required tool rate
  • Correct final decision rate

Datasets

  • Requesting Time Off (TO) - 60 samples
  • Customer Request Routing (CR) - 60 samples

Benchmarks

  • AgentArch

Context Entities

Datasets

  • Mock enterprise data with long KB articles and messy JSON tool outputs