BOLAA: orchestrating specialist LLM agents with a controller improves web navigation and reasoning on standard benchmarks

August 11, 20237 min

Overview

Decision SnapshotReady For Pilot

The experiments cover many LLMs and two realistic environments, giving moderate confidence that BOLAA helps web navigation; results are less conclusive for complex multi-action environments and require more real-world tests.

Citations9

Evidence Strength0.78

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese

Links

Abstract / PDF / Code

Why It Matters For Business

Splitting complex agent work into small, specialist LLMs coordinated by a controller can match or beat large single LLM agents and reduce compute cost by enabling smaller models to specialize.

Who Should Care

Summary TLDR

This paper builds and compares six single-agent LAA (LLM-augmented agent) designs and introduces BOLAA, a controller that orchestrates multiple specialist agents. Authors evaluate on WebShop (900 web-shopping tasks) and HotPotQA (300 multi-hop QA tasks) across many LLM backbones (open-source and OpenAI). Key findings: BOLAA yields the highest WebShop rewards and recall; ReAct (few-shot reasoning+action) works best on HotPotQA; pairing architecture and LLM matters more than context length alone; planning helps some open-source LLMs for web tasks but can hurt knowledge reasoning. Code is released.

Problem Statement

Design choices for LLM-based autonomous agents are under-explored. Specifically, we lack systematic comparisons of (1) agent architectures, (2) LLM backbones paired with those architectures, and (3) methods to orchestrate multiple specialist agents for complex, multi-step tasks.

Main Contribution

Defines and implements six LAA architectures: ZS-LAA, ZST-LAA, ReAct, PlanAct, PlanReAct, and BOLAA (controller + labor agents).

Assembles a large empirical benchmark across WebShop (900 tasks) and HotPotQA (300 questions) covering many LLMs (open-source and OpenAI).

Key Findings

Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.

Numbersgpt-3.5-turbo BOLAA reward=0.6567 vs ZS=0.5061 (Table 1)

Practical UseFor web navigation tasks, split search and click into specialist agents managed by a small controller to raise retrieval accuracy and final task reward.

Evidence RefTable 1 WebShop rewards

Few-shot ReAct agents perform best on multi-hop knowledge reasoning (HotPotQA).

Numberstext-davinci-003 ReAct reward=0.4503 vs ZS=0.3430 (Table 3)

Practical UseUse few-shot reasoning+action prompting (ReAct) when solving multi-step QA that needs contextualized retrieval.

Evidence RefTable 3 HotPotQA rewards

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
WebShop average reward (best reported)0.6567 (gpt-3.5-turbo with BOLAA)0.5061 (gpt-3.5-turbo with ZS-LAA)+0.1506WebShop overallTable 1 reports gpt-3.5-turbo BOLAA=0.6567 and ZS=0.5061Table 1
HotPotQA average reward (best reported)0.4503 (text-davinci-003 with ReAct)0.3430 (text-davinci-003 with ZS-LAA)+0.1073HotPotQA overallTable 3 reports text-davinci-003 ReAct=0.4503 and ZS=0.3430Table 3

What To Try In 7 Days

Prototype a simple controller that routes search vs click actions to two small fine-tuned models for your web task and compare recall and final reward.

If you have access to a strong API model, test a zero-shot prompt agent first—it may already be near-best.

On internal open-source models, add an explicit planning step and measure gains on multi-step action tasks, but skip it for retrieval/QA pipelines.

Agent Features

Memory
agent memory of observations/actions/plansstored thoughts/plans for retrieval
Planning
explicit plan-before-action (PlanAct)self-think / Chain-of-Thought (ZST/PlanReAct)
Tool Use
API calls (Wikipedia API in HotPotQA)search and click action primitives
Frameworks
ReActLangchainBOLAA (this work)
Is Agentic

Yes

Architectures
ZS-LAAZST-LAAReActPlanActPlanReActBOLAA
Collaboration
controller selects and mediates between labor agents (BOLAA)specialist agents for different action types (search, click)

Optimization Features

Token Efficiency
noted context length trade-offs; longer context can increase hallucination

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

BOLAA evaluated primarily on web navigation and not on environments with tightly coupled, compounding actions.

Controller selection logic is handcrafted; autonomous controller behavior is left to future work.

When Not To Use

Knowledge-heavy multi-hop QA where few-shot ReAct outperforms planning-based orchestration

Environments with tightly coupled actions where splitting into independent labor agents is infeasible

Failure Modes

Controller misrouting: wrong labor agent chosen leads to invalid actions.

Plan hallucination: pre-generated plans can mislead downstream actions in reasoning tasks.

Core Entities

Models

fastchat-t5-3bvicuna-7bvicuna-13bvicuna-33bllama-2-7bllama-2-13bllama-2-70bmpt-7b-instructmpt-30b-instructxgen-8k-7b-instructlongchat-7b-16klongchat-13b-16ktext-davinci-003gpt-3.5-turbogpt-3.5-turbo-16k

Metrics

Reward (WebShop: attribute overlap; HotPotQA: F1)Recall (WebShop: ground-truth retrieval rate)

Datasets

WebShop (900 sampled tasks)HotPotQA (300 sampled questions)

Benchmarks

WebShop benchmark (attribute-overlap reward, recall)HotPotQA benchmark (F1 reward)