AgentBoard: a 9-task, 1,013-environment benchmark + toolkit that tracks stepwise progress for multi-turn LLM agents

Overview

Decision SnapshotReady For Pilot

The benchmark and toolkit are ready for research and diagnostics (open-source, WandB panels). Wider production use needs careful sandboxing for real web/tool actions and scaling of human subgoal annotation.

Citations8

Evidence Strength0.80

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, Junxian He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AgentBoard gives stepwise progress signals and diagnostic visualizations so teams can see partial improvements, debug grounding/formatting faults, and prioritize model upgrades or targeted fine-tuning instead of chasing binary success.

Who Should Care

Product Manager ML Engineer CTO Founder

Summary TLDR

AgentBoard is an open-source benchmark and evaluation toolkit for LLMs acting as agents. It bundles 9 diverse, text-only tasks (1,013 environments) that require multi-turn interaction in partially observable settings, and introduces a human-verified fine-grained progress rate that tracks partial task completion step-by-step. The toolkit offers grounding accuracy, sub-skill scoring, long-range interaction plots, and a WandB visualization panel. The paper shows progress rate exposes differences missed by final success rate and reports broad model results (GPT‑4 leads; open-weight models lag).

Problem Statement

Existing agent benchmarks either lack multi-turn, partially observable tasks or rely on binary success rates that hide partial progress. This makes it hard to compare and debug LLM agents, especially when many models get near-zero success but still make meaningful intermediate progress.

Main Contribution

A unified, open-source benchmark (AGENTBOARD) with 9 task families and 1,013 human-verified environments covering embodied, game, web, and tool scenarios.

A fine-grained progress rate metric that scores intermediate progress per step using subgoal annotations or state matching, validated against human raters (Pearson ρ>0.95).

Key Findings

Fine-grained progress rate exposes partial progress that success rate misses.

NumbersExample: Llama2-13b progress 18.9% vs Mistral-7b 24.6% while both have ~2–3% success

Practical UseUse progress rate to track incremental improvements and rank models when final success is near zero.

Evidence RefTable 3, §4.2

Progress rate aligns strongly with human judgment.

NumbersPearson correlation ρ > 0.95 across tasks (60 trajectories/task)

Practical UseAutomatic progress scores are reliable proxies for human assessment; you can trust them for large-scale comparisons.

Evidence RefFigure 3, §3.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average progress rate (GPT-4)	70.0%	—	—	Avg. across 9 AGENTBOARD environments	Table 3 reports GPT-4 avg progress 70.0%	Table 3
Correlation between automatic progress rate and human scores	Pearson ρ > 0.95	—	—	60 trajectories per task, 8 tasks	Figure 3 and §3.2 report Pearson correlation >0.95 and substantial Fleiss' κ agreement	§3.2, Figure 3

What To Try In 7 Days

Run AGENTBOARD on your agent setup to collect progress curves, not just final success.

Use grounding accuracy reports to fix action-format/IO errors before changing model weights.

Compare a smaller tuned model vs a stronger base model using progress rate to decide cost-vs-benefit.

Agent Features

Memory

sliding-window context for long interactions (LangChain-style)supports variable context lengths per model

Planning

multi-turn stepwise planningsubgoal decomposition (annotated)

Tool Use

function-calling style actions for tools (Todo, Sheets, APIs)web actions (click, new tab, goto)

Frameworks

AGENTBOARD evaluation toolkitWandB visualizationvLLM inference stack

Is Agentic

Yes

Architectures

reflex act-only agent (text input → action output)one-shot in-context prompting

Optimization Features

Token Efficiency

sliding-window tradeoff for long interactions (keeps recent history)

Infra Optimization

runtime estimates for APIs vs local GPUs (Table 18)

Training Optimization

agent-specific instruction tuning improves open-weight models (AgentLM, xLAM)

Inference Optimization

use vLLM for faster batched decoding

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/hkust-nlp/AgentBoard

Data URLs

https://github.com/hkust-nlp/AgentBoard

Risks & Boundaries

Limitations

Relies on human-annotated subgoals to compute progress rate; annotation is costly and subjective (Appendix B).

Primarily evaluated in simulated/text environments; results may not transfer directly to real-world web or physical systems.

When Not To Use

When you need multimodal (vision/audio) agent evaluation — AGENTBOARD is text-only.

When you require real-world continuous web/system operation without heavy sandboxing.

Failure Modes

Grounding errors: model outputs invalid actions or wrong formats and fails execution.

Context overflow: long interactions lose earlier context even with sliding window.

Core Entities

Models

GPT-4GPT-3.5-TurboGPT-3.5-Turbo-16kClaude2Claude3-HaikuGemini1.5-FlashLlama3-70bLlama3-8bLlama2-70bLlama2-13bMistral-7bCodeLlama-34bCodeLlama-13bDeepSeek-67bAgentLM-70bxLAM-70bLemur-70bVicuna-13b-16kText-Davinci-003

Metrics

Progress Rate (per-step, subgoal/match)Success Rate (final completion)AccuracySub-skill Scores (memory, planning, world modeling, self-reflection, grounding, spatial)

Datasets

AGENTBOARD (9 tasks, 1,013 envs)AlfWorldScienceWorldBabyAIJerichoPDDL (PDDLGym)WebShopWebArenaTool-QueryTool-Operation

Benchmarks

AgentBenchGAIAMINTAPI-BankToolEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-grained progress rate exposes partial progress that success rate misses.

Progress rate aligns strongly with human judgment.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding