Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
9
Why It Matters For Business
Splitting complex agent work into small, specialist LLMs coordinated by a controller can match or beat large single LLM agents and reduce compute cost by enabling smaller models to specialize.
Summary TLDR
This paper builds and compares six single-agent LAA (LLM-augmented agent) designs and introduces BOLAA, a controller that orchestrates multiple specialist agents. Authors evaluate on WebShop (900 web-shopping tasks) and HotPotQA (300 multi-hop QA tasks) across many LLM backbones (open-source and OpenAI). Key findings: BOLAA yields the highest WebShop rewards and recall; ReAct (few-shot reasoning+action) works best on HotPotQA; pairing architecture and LLM matters more than context length alone; planning helps some open-source LLMs for web tasks but can hurt knowledge reasoning. Code is released.
Problem Statement
Design choices for LLM-based autonomous agents are under-explored. Specifically, we lack systematic comparisons of (1) agent architectures, (2) LLM backbones paired with those architectures, and (3) methods to orchestrate multiple specialist agents for complex, multi-step tasks.
Main Contribution
Defines and implements six LAA architectures: ZS-LAA, ZST-LAA, ReAct, PlanAct, PlanReAct, and BOLAA (controller + labor agents).
Assembles a large empirical benchmark across WebShop (900 tasks) and HotPotQA (300 questions) covering many LLMs (open-source and OpenAI).
Shows BOLAA (separate search/click agents + controller) improves web navigation recall and reward across LLMs and releases code at github.com/salesforce/BOLAA.
Key Findings
Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.
Few-shot ReAct agents perform best on multi-hop knowledge reasoning (HotPotQA).
Powerful API LLMs can achieve strong agent behavior even with simple zero-shot agents.
Planning flows help some open-source LLMs on web tasks but hurt knowledge-reasoning tasks.
Longer context length alone does not guarantee better agent performance and may increase hallucination.
Results
WebShop average reward (best reported)
HotPotQA average reward (best reported)
WebShop recall (best reported)
Who Should Care
What To Try In 7 Days
Prototype a simple controller that routes search vs click actions to two small fine-tuned models for your web task and compare recall and final reward.
If you have access to a strong API model, test a zero-shot prompt agent first—it may already be near-best.
On internal open-source models, add an explicit planning step and measure gains on multi-step action tasks, but skip it for retrieval/QA pipelines.
Agent Features
Memory
- agent memory of observations/actions/plans
- stored thoughts/plans for retrieval
Planning
- explicit plan-before-action (PlanAct)
- self-think / Chain-of-Thought (ZST/PlanReAct)
Tool Use
- API calls (Wikipedia API in HotPotQA)
- search and click action primitives
Frameworks
- ReAct
- Langchain
- BOLAA (this work)
Is Agentic
true
Architectures
- ZS-LAA
- ZST-LAA
- ReAct
- PlanAct
- PlanReAct
- BOLAA
Collaboration
- controller selects and mediates between labor agents (BOLAA)
- specialist agents for different action types (search, click)
Optimization Features
Token Efficiency
- noted context length trade-offs; longer context can increase hallucination
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- BOLAA evaluated primarily on web navigation and not on environments with tightly coupled, compounding actions.
- Controller selection logic is handcrafted; autonomous controller behavior is left to future work.
- Hallucination and error compounding increase when agents run for many steps or with longer context.
When Not To Use
- Knowledge-heavy multi-hop QA where few-shot ReAct outperforms planning-based orchestration
- Environments with tightly coupled actions where splitting into independent labor agents is infeasible
- Settings where a single strong API LLM is already available and latency/cost trade-offs favor one model
Failure Modes
- Controller misrouting: wrong labor agent chosen leads to invalid actions.
- Plan hallucination: pre-generated plans can mislead downstream actions in reasoning tasks.
- Error compounding: small mistakes early in long runs lead to cascading failures.
Core Entities
Models
- fastchat-t5-3b
- vicuna-7b
- vicuna-13b
- vicuna-33b
- llama-2-7b
- llama-2-13b
- llama-2-70b
- mpt-7b-instruct
- mpt-30b-instruct
- xgen-8k-7b-instruct
- longchat-7b-16k
- longchat-13b-16k
- text-davinci-003
- gpt-3.5-turbo
- gpt-3.5-turbo-16k
Metrics
- Reward (WebShop: attribute overlap; HotPotQA: F1)
- Recall (WebShop: ground-truth retrieval rate)
Datasets
- WebShop (900 sampled tasks)
- HotPotQA (300 sampled questions)
Benchmarks
- WebShop benchmark (attribute-overlap reward, recall)
- HotPotQA benchmark (F1 reward)

