Overview
The experiments cover many LLMs and two realistic environments, giving moderate confidence that BOLAA helps web navigation; results are less conclusive for complex multi-action environments and require more real-world tests.
Citations9
Evidence Strength0.78
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
Splitting complex agent work into small, specialist LLMs coordinated by a controller can match or beat large single LLM agents and reduce compute cost by enabling smaller models to specialize.
Who Should Care
Summary TLDR
This paper builds and compares six single-agent LAA (LLM-augmented agent) designs and introduces BOLAA, a controller that orchestrates multiple specialist agents. Authors evaluate on WebShop (900 web-shopping tasks) and HotPotQA (300 multi-hop QA tasks) across many LLM backbones (open-source and OpenAI). Key findings: BOLAA yields the highest WebShop rewards and recall; ReAct (few-shot reasoning+action) works best on HotPotQA; pairing architecture and LLM matters more than context length alone; planning helps some open-source LLMs for web tasks but can hurt knowledge reasoning. Code is released.
Problem Statement
Design choices for LLM-based autonomous agents are under-explored. Specifically, we lack systematic comparisons of (1) agent architectures, (2) LLM backbones paired with those architectures, and (3) methods to orchestrate multiple specialist agents for complex, multi-step tasks.
Main Contribution
Defines and implements six LAA architectures: ZS-LAA, ZST-LAA, ReAct, PlanAct, PlanReAct, and BOLAA (controller + labor agents).
Assembles a large empirical benchmark across WebShop (900 tasks) and HotPotQA (300 questions) covering many LLMs (open-source and OpenAI).
Key Findings
Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.
Few-shot ReAct agents perform best on multi-hop knowledge reasoning (HotPotQA).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| WebShop average reward (best reported) | 0.6567 (gpt-3.5-turbo with BOLAA) | 0.5061 (gpt-3.5-turbo with ZS-LAA) | +0.1506 | WebShop overall | Table 1 reports gpt-3.5-turbo BOLAA=0.6567 and ZS=0.5061 | Table 1 |
| HotPotQA average reward (best reported) | 0.4503 (text-davinci-003 with ReAct) | 0.3430 (text-davinci-003 with ZS-LAA) | +0.1073 | HotPotQA overall | Table 3 reports text-davinci-003 ReAct=0.4503 and ZS=0.3430 | Table 3 |
What To Try In 7 Days
Prototype a simple controller that routes search vs click actions to two small fine-tuned models for your web task and compare recall and final reward.
If you have access to a strong API model, test a zero-shot prompt agent first—it may already be near-best.
On internal open-source models, add an explicit planning step and measure gains on multi-step action tasks, but skip it for retrieval/QA pipelines.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Reproducibility
Code URLs
Risks & Boundaries
Limitations
BOLAA evaluated primarily on web navigation and not on environments with tightly coupled, compounding actions.
Controller selection logic is handcrafted; autonomous controller behavior is left to future work.
When Not To Use
Knowledge-heavy multi-hop QA where few-shot ReAct outperforms planning-based orchestration
Environments with tightly coupled actions where splitting into independent labor agents is infeasible
Failure Modes
Controller misrouting: wrong labor agent chosen leads to invalid actions.
Plan hallucination: pre-generated plans can mislead downstream actions in reasoning tasks.

