Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
8
Why It Matters For Business
AgentBoard gives stepwise progress signals and diagnostic visualizations so teams can see partial improvements, debug grounding/formatting faults, and prioritize model upgrades or targeted fine-tuning instead of chasing binary success.
Summary TLDR
AgentBoard is an open-source benchmark and evaluation toolkit for LLMs acting as agents. It bundles 9 diverse, text-only tasks (1,013 environments) that require multi-turn interaction in partially observable settings, and introduces a human-verified fine-grained progress rate that tracks partial task completion step-by-step. The toolkit offers grounding accuracy, sub-skill scoring, long-range interaction plots, and a WandB visualization panel. The paper shows progress rate exposes differences missed by final success rate and reports broad model results (GPT‑4 leads; open-weight models lag).
Problem Statement
Existing agent benchmarks either lack multi-turn, partially observable tasks or rely on binary success rates that hide partial progress. This makes it hard to compare and debug LLM agents, especially when many models get near-zero success but still make meaningful intermediate progress.
Main Contribution
A unified, open-source benchmark (AGENTBOARD) with 9 task families and 1,013 human-verified environments covering embodied, game, web, and tool scenarios.
A fine-grained progress rate metric that scores intermediate progress per step using subgoal annotations or state matching, validated against human raters (Pearson ρ>0.95).
An analytical evaluation toolkit and WandB visualization panel that reports progress curves, grounding accuracy, sub-skill scores, easy/hard splits, and trajectories.
A large empirical study comparing proprietary and open-weight LLM agents and showing practical diagnostics (e.g., grounding errors, context-length limits).
Key Findings
Fine-grained progress rate exposes partial progress that success rate misses.
Progress rate aligns strongly with human judgment.
Proprietary models outperform open-weight models on agent tasks.
Grounding (producing valid actions) varies widely and limits performance.
Results
Average progress rate (GPT-4)
Correlation between automatic progress rate and human scores
Accuracy
Progress gap proprietary vs top open-weight
Who Should Care
What To Try In 7 Days
Run AGENTBOARD on your agent setup to collect progress curves, not just final success.
Use grounding accuracy reports to fix action-format/IO errors before changing model weights.
Compare a smaller tuned model vs a stronger base model using progress rate to decide cost-vs-benefit.
Agent Features
Memory
- sliding-window context for long interactions (LangChain-style)
- supports variable context lengths per model
Planning
- multi-turn stepwise planning
- subgoal decomposition (annotated)
Tool Use
- function-calling style actions for tools (Todo, Sheets, APIs)
- web actions (click, new tab, goto)
Frameworks
- AGENTBOARD evaluation toolkit
- WandB visualization
- vLLM inference stack
Is Agentic
true
Architectures
- reflex act-only agent (text input → action output)
- one-shot in-context prompting
Optimization Features
Token Efficiency
- sliding-window tradeoff for long interactions (keeps recent history)
Infra Optimization
- runtime estimates for APIs vs local GPUs (Table 18)
Training Optimization
- agent-specific instruction tuning improves open-weight models (AgentLM, xLAM)
Inference Optimization
- use vLLM for faster batched decoding
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Relies on human-annotated subgoals to compute progress rate; annotation is costly and subjective (Appendix B).
- Primarily evaluated in simulated/text environments; results may not transfer directly to real-world web or physical systems.
- Some tasks required manual simplification to guarantee a unique subgoal path (affects ~<5% of examples).
When Not To Use
- When you need multimodal (vision/audio) agent evaluation — AGENTBOARD is text-only.
- When you require real-world continuous web/system operation without heavy sandboxing.
- If you cannot commit resources to annotate task-specific subgoals for new problems.
Failure Modes
- Grounding errors: model outputs invalid actions or wrong formats and fails execution.
- Context overflow: long interactions lose earlier context even with sliding window.
- Annotation bias: human-labeled subgoals can skew progress scores if inconsistent.
Core Entities
Models
- GPT-4
- GPT-3.5-Turbo
- GPT-3.5-Turbo-16k
- Claude2
- Claude3-Haiku
- Gemini1.5-Flash
- Llama3-70b
- Llama3-8b
- Llama2-70b
- Llama2-13b
- Mistral-7b
- CodeLlama-34b
- CodeLlama-13b
- DeepSeek-67b
- AgentLM-70b
- xLAM-70b
- Lemur-70b
- Vicuna-13b-16k
- Text-Davinci-003
Metrics
- Progress Rate (per-step, subgoal/match)
- Success Rate (final completion)
- Accuracy
- Sub-skill Scores (memory, planning, world modeling, self-reflection, grounding, spatial)
Datasets
- AGENTBOARD (9 tasks, 1,013 envs)
- AlfWorld
- ScienceWorld
- BabyAI
- Jericho
- PDDL (PDDLGym)
- WebShop
- WebArena
- Tool-Query
- Tool-Operation
Benchmarks
- AgentBench
- GAIA
- MINT
- API-Bank
- ToolEval

