AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Overview

Decision SnapshotNeeds Validation

The benchmark provides systematic, reproducible tests across key architecture choices and clear numeric results; results are limited to two workflows and six models so generalization is moderate.

Citations0

Evidence Strength0.80

Confidence0.86

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 30%

Novelty: 40%

Authors

Tara Bogavelli, Roshnee Sharma, Hari Subramani

Links

Abstract / PDF / Code

Why It Matters For Business

Agent architectures and model choice change real-world success and reliability; pick and test combinations rather than trusting a single best design.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

AgentArch measures how 18 agent architectures (single vs multi-agent, ReAct vs function-calling, memory styles, thinking tools) perform on two enterprise workflows. Best end-to-end success rates are still low: 70.8% on a simple "Request Time Off" task (GPT-4.1) and 35.3% on a complex "Customer Routing" task (Sonnet 4). Function calling usually beats ReAct; thinking tools help non-reasoning models on simple workflows; multi-agent ReAct performs poorly and often hallucinates. The benchmark shows large, model-specific architecture effects and low reliability across repeated trials (pass^k peak 0.0634).

Problem Statement

Enterprise teams need guidance on which agentic architecture to choose. Prior work tests components in isolation; practitioners lack systematic evidence on how orchestration, prompting style, memory, and reasoning tools interact in real enterprise workflows.

Main Contribution

AgentArch benchmark evaluating 18 agentic configurations across six LLMs on two realistic enterprise workflows.

Joint analysis of four design dimensions: orchestration, agent prompting style (function calling vs ReAct), memory sharing (complete vs summarized), and thinking-tool integration.

Key Findings

Top end-to-end (Acceptable pass@1) scores remain low on enterprise tasks.

NumbersTO best 70.8% (GPT-4.1); CR best 35.3% (Sonnet 4).

Practical UseExpect frequent failures in production; plan for human oversight and retries rather than full automation.

Evidence RefAbstract, Sec.4.1, Fig.3, Sec.4.2

Function calling generally outperforms ReAct across models and tasks.

NumbersFunction-calling cells show higher pass@1 in heatmaps (Fig.3).

Practical UsePrefer function-calling agent prompts when reliable tool selection and argument fidelity matter.

Evidence RefSec.4.1, Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Acceptable pass@1 (best on TO)	70.8% (GPT-4.1, single-agent FC, summarized memory, thinking tools)	—	—	Requesting Time Off (TO)	Sec.4.1, Sec.4.2	Fig.3, Sec.4.2
Acceptable pass@1 (best on CR)	35.3% (Claude Sonnet 4, single-agent function calling)	—	—	Customer Request Routing (CR)	Sec.4.1, Sec.4.2	Fig.3, Sec.4.2

What To Try In 7 Days

Run AgentArch or a small subset on your own workflows to find model-architecture fits.

Use function-calling prompts first for tool-heavy workflows and compare against ReAct on one task.

Enable thinking tools (math/summarize) for tasks that require calculations or aggregation; measure latency trade-offs.

Agent Features

Memory

complete_memorysummarized_memory

Planning

ReActfunction_calling

Tool Use

function_callingthinking_tools

Frameworks

ReAct_promptfunction_calling_API

Is Agentic

Yes

Architectures

single_agentmulti_agentorchestrator_isolatedorchestrator_open_network

Collaboration

orchestrator_mediatedagent_to_agent

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ServiceNow/AgentArch

Risks & Boundaries

Limitations

Only two use cases (60 samples each) — not covering broad enterprise diversity.

Six models tested; limited open-source and reasoning-model coverage.

When Not To Use

If your workflow is multimodal (images, PDFs) — benchmark is text-only.

If you need conversational, user-in-the-loop workflows — this benchmark focuses on autonomous runs.

Failure Modes

Hallucinated tools or agents (especially in multi-agent ReAct).

Wrong tool arguments causing failed side effects despite correct final decision reasoning.

Core Entities

Models

GPT-4.1GPT-4oGPT-4.1-minio3-miniLLaMA 3.3 70BClaude Sonnet 4

Metrics

Acceptable Score (tools + args + outcome)Acceptable pass@1PassˆK (all k trials succeed)Hallucination rateTool repetition rateMissing required tool rateCorrect final decision rate

Datasets

Requesting Time Off (TO) - 60 samplesCustomer Request Routing (CR) - 60 samples

Benchmarks

AgentArch

Context Entities

Datasets

Mock enterprise data with long KB articles and messy JSON tool outputs

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top end-to-end (Acceptable pass@1) scores remain low on enterprise tasks.

Function calling generally outperforms ReAct across models and tasks.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding