AgentBoard: a 9-task, 1,013-environment benchmark + toolkit that tracks stepwise progress for multi-turn LLM agents

January 24, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

8

Authors

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He

Links

Abstract / PDF

Why It Matters For Business

AgentBoard gives stepwise progress signals and diagnostic visualizations so teams can see partial improvements, debug grounding/formatting faults, and prioritize model upgrades or targeted fine-tuning instead of chasing binary success.

Summary TLDR

AgentBoard is an open-source benchmark and evaluation toolkit for LLMs acting as agents. It bundles 9 diverse, text-only tasks (1,013 environments) that require multi-turn interaction in partially observable settings, and introduces a human-verified fine-grained progress rate that tracks partial task completion step-by-step. The toolkit offers grounding accuracy, sub-skill scoring, long-range interaction plots, and a WandB visualization panel. The paper shows progress rate exposes differences missed by final success rate and reports broad model results (GPT‑4 leads; open-weight models lag).

Problem Statement

Existing agent benchmarks either lack multi-turn, partially observable tasks or rely on binary success rates that hide partial progress. This makes it hard to compare and debug LLM agents, especially when many models get near-zero success but still make meaningful intermediate progress.

Main Contribution

A unified, open-source benchmark (AGENTBOARD) with 9 task families and 1,013 human-verified environments covering embodied, game, web, and tool scenarios.

A fine-grained progress rate metric that scores intermediate progress per step using subgoal annotations or state matching, validated against human raters (Pearson ρ>0.95).

An analytical evaluation toolkit and WandB visualization panel that reports progress curves, grounding accuracy, sub-skill scores, easy/hard splits, and trajectories.

A large empirical study comparing proprietary and open-weight LLM agents and showing practical diagnostics (e.g., grounding errors, context-length limits).

Key Findings

Fine-grained progress rate exposes partial progress that success rate misses.

NumbersExample: Llama2-13b progress 18.9% vs Mistral-7b 24.6% while both have ~2–3% success

Progress rate aligns strongly with human judgment.

NumbersPearson correlation ρ > 0.95 across tasks (60 trajectories/task)

Proprietary models outperform open-weight models on agent tasks.

NumbersGPT-4 average progress rate 70.0% vs top open-weight ~40.2%

Grounding (producing valid actions) varies widely and limits performance.

NumbersGPT-4 grounding accuracy 85.6% average; many open models <70%

Results

Average progress rate (GPT-4)

Value70.0%

Correlation between automatic progress rate and human scores

ValuePearson ρ > 0.95

Accuracy

Value85.6%

Progress gap proprietary vs top open-weight

Value70.0% vs ~40.2%

Who Should Care

What To Try In 7 Days

Run AGENTBOARD on your agent setup to collect progress curves, not just final success.

Use grounding accuracy reports to fix action-format/IO errors before changing model weights.

Compare a smaller tuned model vs a stronger base model using progress rate to decide cost-vs-benefit.

Agent Features

Memory

  • sliding-window context for long interactions (LangChain-style)
  • supports variable context lengths per model

Planning

  • multi-turn stepwise planning
  • subgoal decomposition (annotated)

Tool Use

  • function-calling style actions for tools (Todo, Sheets, APIs)
  • web actions (click, new tab, goto)

Frameworks

  • AGENTBOARD evaluation toolkit
  • WandB visualization
  • vLLM inference stack

Is Agentic

true

Architectures

  • reflex act-only agent (text input → action output)
  • one-shot in-context prompting

Optimization Features

Token Efficiency

  • sliding-window tradeoff for long interactions (keeps recent history)

Infra Optimization

  • runtime estimates for APIs vs local GPUs (Table 18)

Training Optimization

  • agent-specific instruction tuning improves open-weight models (AgentLM, xLAM)

Inference Optimization

  • use vLLM for faster batched decoding

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Relies on human-annotated subgoals to compute progress rate; annotation is costly and subjective (Appendix B).
  • Primarily evaluated in simulated/text environments; results may not transfer directly to real-world web or physical systems.
  • Some tasks required manual simplification to guarantee a unique subgoal path (affects ~<5% of examples).

When Not To Use

  • When you need multimodal (vision/audio) agent evaluation — AGENTBOARD is text-only.
  • When you require real-world continuous web/system operation without heavy sandboxing.
  • If you cannot commit resources to annotate task-specific subgoals for new problems.

Failure Modes

  • Grounding errors: model outputs invalid actions or wrong formats and fails execution.
  • Context overflow: long interactions lose earlier context even with sliding window.
  • Annotation bias: human-labeled subgoals can skew progress scores if inconsistent.

Core Entities

Models

  • GPT-4
  • GPT-3.5-Turbo
  • GPT-3.5-Turbo-16k
  • Claude2
  • Claude3-Haiku
  • Gemini1.5-Flash
  • Llama3-70b
  • Llama3-8b
  • Llama2-70b
  • Llama2-13b
  • Mistral-7b
  • CodeLlama-34b
  • CodeLlama-13b
  • DeepSeek-67b
  • AgentLM-70b
  • xLAM-70b
  • Lemur-70b
  • Vicuna-13b-16k
  • Text-Davinci-003

Metrics

  • Progress Rate (per-step, subgoal/match)
  • Success Rate (final completion)
  • Accuracy
  • Sub-skill Scores (memory, planning, world modeling, self-reflection, grounding, spatial)

Datasets

  • AGENTBOARD (9 tasks, 1,013 envs)
  • AlfWorld
  • ScienceWorld
  • BabyAI
  • Jericho
  • PDDL (PDDLGym)
  • WebShop
  • WebArena
  • Tool-Query
  • Tool-Operation

Benchmarks

  • AgentBench
  • GAIA
  • MINT
  • API-Bank
  • ToolEval