AgentBoard: a 9-task, 1,013-environment benchmark + toolkit that tracks stepwise progress for multi-turn LLM agents

January 24, 20247 min

Overview

Decision SnapshotReady For Pilot

The benchmark and toolkit are ready for research and diagnostics (open-source, WandB panels). Wider production use needs careful sandboxing for real web/tool actions and scaling of human subgoal annotation.

Citations8

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AgentBoard gives stepwise progress signals and diagnostic visualizations so teams can see partial improvements, debug grounding/formatting faults, and prioritize model upgrades or targeted fine-tuning instead of chasing binary success.

Who Should Care

Summary TLDR

AgentBoard is an open-source benchmark and evaluation toolkit for LLMs acting as agents. It bundles 9 diverse, text-only tasks (1,013 environments) that require multi-turn interaction in partially observable settings, and introduces a human-verified fine-grained progress rate that tracks partial task completion step-by-step. The toolkit offers grounding accuracy, sub-skill scoring, long-range interaction plots, and a WandB visualization panel. The paper shows progress rate exposes differences missed by final success rate and reports broad model results (GPT‑4 leads; open-weight models lag).

Problem Statement

Existing agent benchmarks either lack multi-turn, partially observable tasks or rely on binary success rates that hide partial progress. This makes it hard to compare and debug LLM agents, especially when many models get near-zero success but still make meaningful intermediate progress.

Main Contribution

A unified, open-source benchmark (AGENTBOARD) with 9 task families and 1,013 human-verified environments covering embodied, game, web, and tool scenarios.

A fine-grained progress rate metric that scores intermediate progress per step using subgoal annotations or state matching, validated against human raters (Pearson ρ>0.95).

Key Findings

Fine-grained progress rate exposes partial progress that success rate misses.

NumbersExample: Llama2-13b progress 18.9% vs Mistral-7b 24.6% while both have ~23% success

Practical UseUse progress rate to track incremental improvements and rank models when final success is near zero.

Evidence RefTable 3, §4.2

Progress rate aligns strongly with human judgment.

NumbersPearson correlation ρ > 0.95 across tasks (60 trajectories/task)

Practical UseAutomatic progress scores are reliable proxies for human assessment; you can trust them for large-scale comparisons.

Evidence RefFigure 3, §3.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average progress rate (GPT-4)70.0%Avg. across 9 AGENTBOARD environmentsTable 3 reports GPT-4 avg progress 70.0%Table 3
Correlation between automatic progress rate and human scoresPearson ρ > 0.9560 trajectories per task, 8 tasksFigure 3 and §3.2 report Pearson correlation >0.95 and substantial Fleiss' κ agreement§3.2, Figure 3

What To Try In 7 Days

Run AGENTBOARD on your agent setup to collect progress curves, not just final success.

Use grounding accuracy reports to fix action-format/IO errors before changing model weights.

Compare a smaller tuned model vs a stronger base model using progress rate to decide cost-vs-benefit.

Agent Features

Memory
sliding-window context for long interactions (LangChain-style)supports variable context lengths per model
Planning
multi-turn stepwise planningsubgoal decomposition (annotated)
Tool Use
function-calling style actions for tools (Todo, Sheets, APIs)web actions (click, new tab, goto)
Frameworks
AGENTBOARD evaluation toolkitWandB visualizationvLLM inference stack
Is Agentic

Yes

Architectures
reflex act-only agent (text input → action output)one-shot in-context prompting

Optimization Features

Token Efficiency
sliding-window tradeoff for long interactions (keeps recent history)
Infra Optimization
runtime estimates for APIs vs local GPUs (Table 18)
Training Optimization
agent-specific instruction tuning improves open-weight models (AgentLM, xLAM)
Inference Optimization
use vLLM for faster batched decoding

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Relies on human-annotated subgoals to compute progress rate; annotation is costly and subjective (Appendix B).

Primarily evaluated in simulated/text environments; results may not transfer directly to real-world web or physical systems.

When Not To Use

When you need multimodal (vision/audio) agent evaluation — AGENTBOARD is text-only.

When you require real-world continuous web/system operation without heavy sandboxing.

Failure Modes

Grounding errors: model outputs invalid actions or wrong formats and fails execution.

Context overflow: long interactions lose earlier context even with sliding window.

Core Entities

Models

GPT-4GPT-3.5-TurboGPT-3.5-Turbo-16kClaude2Claude3-HaikuGemini1.5-FlashLlama3-70bLlama3-8bLlama2-70bLlama2-13bMistral-7bCodeLlama-34bCodeLlama-13bDeepSeek-67bAgentLM-70bxLAM-70bLemur-70bVicuna-13b-16kText-Davinci-003

Metrics

Progress Rate (per-step, subgoal/match)Success Rate (final completion)AccuracySub-skill Scores (memory, planning, world modeling, self-reflection, grounding, spatial)

Datasets

AGENTBOARD (9 tasks, 1,013 envs)AlfWorldScienceWorldBabyAIJerichoPDDL (PDDLGym)WebShopWebArenaTool-QueryTool-Operation

Benchmarks

AgentBenchGAIAMINTAPI-BankToolEval