BOLAA: orchestrating specialist LLM agents with a controller improves web navigation and reasoning on standard benchmarks

August 11, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

9

Authors

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, Silvio Savarese

Links

Abstract / PDF

Why It Matters For Business

Splitting complex agent work into small, specialist LLMs coordinated by a controller can match or beat large single LLM agents and reduce compute cost by enabling smaller models to specialize.

Summary TLDR

This paper builds and compares six single-agent LAA (LLM-augmented agent) designs and introduces BOLAA, a controller that orchestrates multiple specialist agents. Authors evaluate on WebShop (900 web-shopping tasks) and HotPotQA (300 multi-hop QA tasks) across many LLM backbones (open-source and OpenAI). Key findings: BOLAA yields the highest WebShop rewards and recall; ReAct (few-shot reasoning+action) works best on HotPotQA; pairing architecture and LLM matters more than context length alone; planning helps some open-source LLMs for web tasks but can hurt knowledge reasoning. Code is released.

Problem Statement

Design choices for LLM-based autonomous agents are under-explored. Specifically, we lack systematic comparisons of (1) agent architectures, (2) LLM backbones paired with those architectures, and (3) methods to orchestrate multiple specialist agents for complex, multi-step tasks.

Main Contribution

Defines and implements six LAA architectures: ZS-LAA, ZST-LAA, ReAct, PlanAct, PlanReAct, and BOLAA (controller + labor agents).

Assembles a large empirical benchmark across WebShop (900 tasks) and HotPotQA (300 questions) covering many LLMs (open-source and OpenAI).

Shows BOLAA (separate search/click agents + controller) improves web navigation recall and reward across LLMs and releases code at github.com/salesforce/BOLAA.

Key Findings

Orchestrating specialist agents (BOLAA) gives the best WebShop performance across many LLMs.

Numbersgpt-3.5-turbo BOLAA reward=0.6567 vs ZS=0.5061 (Table 1)

Few-shot ReAct agents perform best on multi-hop knowledge reasoning (HotPotQA).

Numberstext-davinci-003 ReAct reward=0.4503 vs ZS=0.3430 (Table 3)

Powerful API LLMs can achieve strong agent behavior even with simple zero-shot agents.

Numberstext-davinci-003 ZS reward=0.5292; gpt-3.5-turbo ZS=0.5061 (Table 1)

Planning flows help some open-source LLMs on web tasks but hurt knowledge-reasoning tasks.

Numbersllama-2-13b PlanAct WebShop reward=0.4892 vs ZS=0.0662; llama-2-70b HotPotQA PlanAct=0.1424 vs ZS=0.2809 (Tables 1,3)

Longer context length alone does not guarantee better agent performance and may increase hallucination.

Numberslongchat-13b-16k BOLAA reward=0.3205 vs vicuna-13b (2k) BOLAA=0.5350 (Table 1); authors note more hallucinations with >4

Results

WebShop average reward (best reported)

Value0.6567 (gpt-3.5-turbo with BOLAA)

Baseline0.5061 (gpt-3.5-turbo with ZS-LAA)

HotPotQA average reward (best reported)

Value0.4503 (text-davinci-003 with ReAct)

Baseline0.3430 (text-davinci-003 with ZS-LAA)

WebShop recall (best reported)

Value0.4011 (gpt-3.5-turbo-16k with ReAct)

Baseline0.3856 (text-davinci-003 ZS)

Who Should Care

What To Try In 7 Days

Prototype a simple controller that routes search vs click actions to two small fine-tuned models for your web task and compare recall and final reward.

If you have access to a strong API model, test a zero-shot prompt agent first—it may already be near-best.

On internal open-source models, add an explicit planning step and measure gains on multi-step action tasks, but skip it for retrieval/QA pipelines.

Agent Features

Memory

  • agent memory of observations/actions/plans
  • stored thoughts/plans for retrieval

Planning

  • explicit plan-before-action (PlanAct)
  • self-think / Chain-of-Thought (ZST/PlanReAct)

Tool Use

  • API calls (Wikipedia API in HotPotQA)
  • search and click action primitives

Frameworks

  • ReAct
  • Langchain
  • BOLAA (this work)

Is Agentic

true

Architectures

  • ZS-LAA
  • ZST-LAA
  • ReAct
  • PlanAct
  • PlanReAct
  • BOLAA

Collaboration

  • controller selects and mediates between labor agents (BOLAA)
  • specialist agents for different action types (search, click)

Optimization Features

Token Efficiency

  • noted context length trade-offs; longer context can increase hallucination

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • BOLAA evaluated primarily on web navigation and not on environments with tightly coupled, compounding actions.
  • Controller selection logic is handcrafted; autonomous controller behavior is left to future work.
  • Hallucination and error compounding increase when agents run for many steps or with longer context.

When Not To Use

  • Knowledge-heavy multi-hop QA where few-shot ReAct outperforms planning-based orchestration
  • Environments with tightly coupled actions where splitting into independent labor agents is infeasible
  • Settings where a single strong API LLM is already available and latency/cost trade-offs favor one model

Failure Modes

  • Controller misrouting: wrong labor agent chosen leads to invalid actions.
  • Plan hallucination: pre-generated plans can mislead downstream actions in reasoning tasks.
  • Error compounding: small mistakes early in long runs lead to cascading failures.

Core Entities

Models

  • fastchat-t5-3b
  • vicuna-7b
  • vicuna-13b
  • vicuna-33b
  • llama-2-7b
  • llama-2-13b
  • llama-2-70b
  • mpt-7b-instruct
  • mpt-30b-instruct
  • xgen-8k-7b-instruct
  • longchat-7b-16k
  • longchat-13b-16k
  • text-davinci-003
  • gpt-3.5-turbo
  • gpt-3.5-turbo-16k

Metrics

  • Reward (WebShop: attribute overlap; HotPotQA: F1)
  • Recall (WebShop: ground-truth retrieval rate)

Datasets

  • WebShop (900 sampled tasks)
  • HotPotQA (300 sampled questions)

Benchmarks

  • WebShop benchmark (attribute-overlap reward, recall)
  • HotPotQA benchmark (F1 reward)