Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.2
Citation Count
0
Why It Matters For Business
Agent architectures and model choice change real-world success and reliability; pick and test combinations rather than trusting a single best design.
Summary TLDR
AgentArch measures how 18 agent architectures (single vs multi-agent, ReAct vs function-calling, memory styles, thinking tools) perform on two enterprise workflows. Best end-to-end success rates are still low: 70.8% on a simple "Request Time Off" task (GPT-4.1) and 35.3% on a complex "Customer Routing" task (Sonnet 4). Function calling usually beats ReAct; thinking tools help non-reasoning models on simple workflows; multi-agent ReAct performs poorly and often hallucinates. The benchmark shows large, model-specific architecture effects and low reliability across repeated trials (pass^k peak 0.0634).
Problem Statement
Enterprise teams need guidance on which agentic architecture to choose. Prior work tests components in isolation; practitioners lack systematic evidence on how orchestration, prompting style, memory, and reasoning tools interact in real enterprise workflows.
Main Contribution
AgentArch benchmark evaluating 18 agentic configurations across six LLMs on two realistic enterprise workflows.
Joint analysis of four design dimensions: orchestration, agent prompting style (function calling vs ReAct), memory sharing (complete vs summarized), and thinking-tool integration.
Quantitative results showing model-specific architecture preferences, reliability gaps, and trade-offs between decision accuracy and end-to-end execution.
Key Findings
Top end-to-end (Acceptable pass@1) scores remain low on enterprise tasks.
Function calling generally outperforms ReAct across models and tasks.
Thinking tools help non-reasoning models on simple calculation-heavy tasks.
Multi-agent ReAct configurations cause many hallucinations and underperform.
Models show strong, differing preferences for architectures and large variance.
End-to-end reliability across repeated trials is very low.
Results
Acceptable pass@1 (best on TO)
Acceptable pass@1 (best on CR)
PassˆK (all 8 trials succeed) peak
Correct final decision (GPT-4.1, CR, multi-agent FC)
Who Should Care
What To Try In 7 Days
Run AgentArch or a small subset on your own workflows to find model-architecture fits.
Use function-calling prompts first for tool-heavy workflows and compare against ReAct on one task.
Enable thinking tools (math/summarize) for tasks that require calculations or aggregation; measure latency trade-offs.
Agent Features
Memory
- complete_memory
- summarized_memory
Planning
- ReAct
- function_calling
Tool Use
- function_calling
- thinking_tools
Frameworks
- ReAct_prompt
- function_calling_API
Is Agentic
true
Architectures
- single_agent
- multi_agent
- orchestrator_isolated
- orchestrator_open_network
Collaboration
- orchestrator_mediated
- agent_to_agent
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only two use cases (60 samples each) — not covering broad enterprise diversity.
- Six models tested; limited open-source and reasoning-model coverage.
- Text-only tools and inputs; no multimodal tools or files.
- All runs at temperature=0 — sampling interactions with architecture unexplored.
- Acceptable Score requires perfect tool+args+outcome and may undercount partial business value.
When Not To Use
- If your workflow is multimodal (images, PDFs) — benchmark is text-only.
- If you need conversational, user-in-the-loop workflows — this benchmark focuses on autonomous runs.
- If you want broad cross-industry claims — only two specific enterprise tasks were tested.
Failure Modes
- Hallucinated tools or agents (especially in multi-agent ReAct).
- Wrong tool arguments causing failed side effects despite correct final decision reasoning.
- High variance across architecture choices causing unpredictability.
- Low multi-trial reliability (very low passˆK).
Core Entities
Models
- GPT-4.1
- GPT-4o
- GPT-4.1-mini
- o3-mini
- LLaMA 3.3 70B
- Claude Sonnet 4
Metrics
- Acceptable Score (tools + args + outcome)
- Acceptable pass@1
- PassˆK (all k trials succeed)
- Hallucination rate
- Tool repetition rate
- Missing required tool rate
- Correct final decision rate
Datasets
- Requesting Time Off (TO) - 60 samples
- Customer Request Routing (CR) - 60 samples
Benchmarks
- AgentArch
Context Entities
Datasets
- Mock enterprise data with long KB articles and messy JSON tool outputs

