BOLAA: orchestrating specialist LLM agents with a controller improves web navigation and reasoning on standard benchmarks

Overview

Decision SnapshotReady For Pilot

The experiments cover many LLMs and two realistic environments, giving moderate confidence that BOLAA helps web navigation; results are less conclusive for complex multi-action environments and require more real-world tests.

Citations9

Evidence Strength0.78

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese

Links

Abstract / PDF / Code

Why It Matters For Business

Splitting complex agent work into small, specialist LLMs coordinated by a controller can match or beat large single LLM agents and reduce compute cost by enabling smaller models to specialize.

Who Should Care

Product Manager ML Engineer CTO Founder Engineering Lead

Summary TLDR

This paper builds and compares six single-agent LAA (LLM-augmented agent) designs and introduces BOLAA, a controller that orchestrates multiple specialist agents. Authors evaluate on WebShop (900 web-shopping tasks) and HotPotQA (300 multi-hop QA tasks) across many LLM backbones (open-source and OpenAI). Key findings: BOLAA yields the highest WebShop rewards and recall; ReAct (few-shot reasoning+action) works best on HotPotQA; pairing architecture and LLM matters more than context length alone; planning helps some open-source LLMs for web tasks but can hurt knowledge reasoning. Code is released.

Problem Statement

Design choices for LLM-based autonomous agents are under-explored. Specifically, we lack systematic comparisons of (1) agent architectures, (2) LLM backbones paired with those architectures, and (3) methods to orchestrate multiple specialist agents for complex, multi-step tasks.

Main Contribution

Defines and implements six LAA architectures: ZS-LAA, ZST-LAA, ReAct, PlanAct, PlanReAct, and BOLAA (controller + labor agents).

Assembles a large empirical benchmark across WebShop (900 tasks) and HotPotQA (300 questions) covering many LLMs (open-source and OpenAI).

Key Findings

Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.

Numbersgpt-3.5-turbo BOLAA reward=0.6567 vs ZS=0.5061 (Table 1)

Practical UseFor web navigation tasks, split search and click into specialist agents managed by a small controller to raise retrieval accuracy and final task reward.

Evidence RefTable 1 WebShop rewards

Few-shot ReAct agents perform best on multi-hop knowledge reasoning (HotPotQA).

Numberstext-davinci-003 ReAct reward=0.4503 vs ZS=0.3430 (Table 3)

Practical UseUse few-shot reasoning+action prompting (ReAct) when solving multi-step QA that needs contextualized retrieval.

Evidence RefTable 3 HotPotQA rewards

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
WebShop average reward (best reported)	0.6567 (gpt-3.5-turbo with BOLAA)	0.5061 (gpt-3.5-turbo with ZS-LAA)	+0.1506	WebShop overall	Table 1 reports gpt-3.5-turbo BOLAA=0.6567 and ZS=0.5061	Table 1
HotPotQA average reward (best reported)	0.4503 (text-davinci-003 with ReAct)	0.3430 (text-davinci-003 with ZS-LAA)	+0.1073	HotPotQA overall	Table 3 reports text-davinci-003 ReAct=0.4503 and ZS=0.3430	Table 3

What To Try In 7 Days

Prototype a simple controller that routes search vs click actions to two small fine-tuned models for your web task and compare recall and final reward.

If you have access to a strong API model, test a zero-shot prompt agent first—it may already be near-best.

On internal open-source models, add an explicit planning step and measure gains on multi-step action tasks, but skip it for retrieval/QA pipelines.

Agent Features

Memory

agent memory of observations/actions/plansstored thoughts/plans for retrieval

Planning

explicit plan-before-action (PlanAct)self-think / Chain-of-Thought (ZST/PlanReAct)

Tool Use

API calls (Wikipedia API in HotPotQA)search and click action primitives

Frameworks

ReActLangchainBOLAA (this work)

Is Agentic

Yes

Architectures

ZS-LAAZST-LAAReActPlanActPlanReActBOLAA

Collaboration

controller selects and mediates between labor agents (BOLAA)specialist agents for different action types (search, click)

Optimization Features

Token Efficiency

noted context length trade-offs; longer context can increase hallucination

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/salesforce/BOLAA

Risks & Boundaries

Limitations

BOLAA evaluated primarily on web navigation and not on environments with tightly coupled, compounding actions.

Controller selection logic is handcrafted; autonomous controller behavior is left to future work.

When Not To Use

Knowledge-heavy multi-hop QA where few-shot ReAct outperforms planning-based orchestration

Environments with tightly coupled actions where splitting into independent labor agents is infeasible

Failure Modes

Controller misrouting: wrong labor agent chosen leads to invalid actions.

Plan hallucination: pre-generated plans can mislead downstream actions in reasoning tasks.

Core Entities

Models

fastchat-t5-3bvicuna-7bvicuna-13bvicuna-33bllama-2-7bllama-2-13bllama-2-70bmpt-7b-instructmpt-30b-instructxgen-8k-7b-instructlongchat-7b-16klongchat-13b-16ktext-davinci-003gpt-3.5-turbogpt-3.5-turbo-16k

Metrics

Reward (WebShop: attribute overlap; HotPotQA: F1)Recall (WebShop: ground-truth retrieval rate)

Datasets

WebShop (900 sampled tasks)HotPotQA (300 sampled questions)

Benchmarks

WebShop benchmark (attribute-overlap reward, recall)HotPotQA benchmark (F1 reward)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.

Few-shot ReAct agents perform best on multi-hop knowledge reasoning (HotPotQA).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding