Overview
The benchmark provides systematic, reproducible tests across key architecture choices and clear numeric results; results are limited to two workflows and six models so generalization is moderate.
Citations0
Evidence Strength0.80
Confidence0.86
Risk Signals12
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 30%
Novelty: 40%
Why It Matters For Business
Agent architectures and model choice change real-world success and reliability; pick and test combinations rather than trusting a single best design.
Who Should Care
Summary TLDR
AgentArch measures how 18 agent architectures (single vs multi-agent, ReAct vs function-calling, memory styles, thinking tools) perform on two enterprise workflows. Best end-to-end success rates are still low: 70.8% on a simple "Request Time Off" task (GPT-4.1) and 35.3% on a complex "Customer Routing" task (Sonnet 4). Function calling usually beats ReAct; thinking tools help non-reasoning models on simple workflows; multi-agent ReAct performs poorly and often hallucinates. The benchmark shows large, model-specific architecture effects and low reliability across repeated trials (pass^k peak 0.0634).
Problem Statement
Enterprise teams need guidance on which agentic architecture to choose. Prior work tests components in isolation; practitioners lack systematic evidence on how orchestration, prompting style, memory, and reasoning tools interact in real enterprise workflows.
Main Contribution
AgentArch benchmark evaluating 18 agentic configurations across six LLMs on two realistic enterprise workflows.
Joint analysis of four design dimensions: orchestration, agent prompting style (function calling vs ReAct), memory sharing (complete vs summarized), and thinking-tool integration.
Key Findings
Top end-to-end (Acceptable pass@1) scores remain low on enterprise tasks.
Function calling generally outperforms ReAct across models and tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Acceptable pass@1 (best on TO) | 70.8% (GPT-4.1, single-agent FC, summarized memory, thinking tools) | — | — | Requesting Time Off (TO) | Sec.4.1, Sec.4.2 | Fig.3, Sec.4.2 |
| Acceptable pass@1 (best on CR) | 35.3% (Claude Sonnet 4, single-agent function calling) | — | — | Customer Request Routing (CR) | Sec.4.1, Sec.4.2 | Fig.3, Sec.4.2 |
What To Try In 7 Days
Run AgentArch or a small subset on your own workflows to find model-architecture fits.
Use function-calling prompts first for tool-heavy workflows and compare against ReAct on one task.
Enable thinking tools (math/summarize) for tasks that require calculations or aggregation; measure latency trade-offs.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Only two use cases (60 samples each) — not covering broad enterprise diversity.
Six models tested; limited open-source and reasoning-model coverage.
When Not To Use
If your workflow is multimodal (images, PDFs) — benchmark is text-only.
If you need conversational, user-in-the-loop workflows — this benchmark focuses on autonomous runs.
Failure Modes
Hallucinated tools or agents (especially in multi-agent ReAct).
Wrong tool arguments causing failed side effects despite correct final decision reasoning.

