AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

September 13, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark provides systematic, reproducible tests across key architecture choices and clear numeric results; results are limited to two workflows and six models so generalization is moderate.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 40%

Authors

Tara Bogavelli, Roshnee Sharma, Hari Subramani

Links

Abstract / PDF / Code

Why It Matters For Business

Agent architectures and model choice change real-world success and reliability; pick and test combinations rather than trusting a single best design.

Who Should Care

Summary TLDR

AgentArch measures how 18 agent architectures (single vs multi-agent, ReAct vs function-calling, memory styles, thinking tools) perform on two enterprise workflows. Best end-to-end success rates are still low: 70.8% on a simple "Request Time Off" task (GPT-4.1) and 35.3% on a complex "Customer Routing" task (Sonnet 4). Function calling usually beats ReAct; thinking tools help non-reasoning models on simple workflows; multi-agent ReAct performs poorly and often hallucinates. The benchmark shows large, model-specific architecture effects and low reliability across repeated trials (pass^k peak 0.0634).

Problem Statement

Enterprise teams need guidance on which agentic architecture to choose. Prior work tests components in isolation; practitioners lack systematic evidence on how orchestration, prompting style, memory, and reasoning tools interact in real enterprise workflows.

Main Contribution

AgentArch benchmark evaluating 18 agentic configurations across six LLMs on two realistic enterprise workflows.

Joint analysis of four design dimensions: orchestration, agent prompting style (function calling vs ReAct), memory sharing (complete vs summarized), and thinking-tool integration.

Key Findings

Top end-to-end (Acceptable pass@1) scores remain low on enterprise tasks.

NumbersTO best 70.8% (GPT-4.1); CR best 35.3% (Sonnet 4).

Practical UseExpect frequent failures in production; plan for human oversight and retries rather than full automation.

Evidence RefAbstract, Sec.4.1, Fig.3, Sec.4.2

Function calling generally outperforms ReAct across models and tasks.

NumbersFunction-calling cells show higher pass@1 in heatmaps (Fig.3).

Practical UsePrefer function-calling agent prompts when reliable tool selection and argument fidelity matter.

Evidence RefSec.4.1, Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Acceptable pass@1 (best on TO)70.8% (GPT-4.1, single-agent FC, summarized memory, thinking tools)Requesting Time Off (TO)Sec.4.1, Sec.4.2Fig.3, Sec.4.2
Acceptable pass@1 (best on CR)35.3% (Claude Sonnet 4, single-agent function calling)Customer Request Routing (CR)Sec.4.1, Sec.4.2Fig.3, Sec.4.2

What To Try In 7 Days

Run AgentArch or a small subset on your own workflows to find model-architecture fits.

Use function-calling prompts first for tool-heavy workflows and compare against ReAct on one task.

Enable thinking tools (math/summarize) for tasks that require calculations or aggregation; measure latency trade-offs.

Agent Features

Memory
complete_memorysummarized_memory
Planning
ReActfunction_calling
Tool Use
function_callingthinking_tools
Frameworks
ReAct_promptfunction_calling_API
Is Agentic

Yes

Architectures
single_agentmulti_agentorchestrator_isolatedorchestrator_open_network
Collaboration
orchestrator_mediatedagent_to_agent

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only two use cases (60 samples each) — not covering broad enterprise diversity.

Six models tested; limited open-source and reasoning-model coverage.

When Not To Use

If your workflow is multimodal (images, PDFs) — benchmark is text-only.

If you need conversational, user-in-the-loop workflows — this benchmark focuses on autonomous runs.

Failure Modes

Hallucinated tools or agents (especially in multi-agent ReAct).

Wrong tool arguments causing failed side effects despite correct final decision reasoning.

Core Entities

Models

GPT-4.1GPT-4oGPT-4.1-minio3-miniLLaMA 3.3 70BClaude Sonnet 4

Metrics

Acceptable Score (tools + args + outcome)Acceptable pass@1PassˆK (all k trials succeed)Hallucination rateTool repetition rateMissing required tool rateCorrect final decision rate

Datasets

Requesting Time Off (TO) - 60 samplesCustomer Request Routing (CR) - 60 samples

Benchmarks

AgentArch

Context Entities

Datasets

Mock enterprise data with long KB articles and messy JSON tool outputs