A 1,000-task, real-server benchmark that measures how well LLMs discover and use tools

January 31, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, Bing Liu

Links

Abstract / PDF

Why It Matters For Business

MCP-Atlas measures how reliably models can find and use production APIs. That matters when you want LLMs to automate multi-step tasks across services: the benchmark highlights that tool awareness and parameter correctness are the biggest practical risks.

Summary TLDR

MCP-Atlas is a 1,000-task benchmark that evaluates LLMs' real-world tool-use skills against 36 live MCP servers and 220 real tools. Tasks require 3–6 tool calls, often across multiple servers, and scoring is objective: a claims-based rubric gives partial credit for correct facts in the final answer. Top models reach ~50–62% pass rates; most failures come from not using the right tools or missing subgoals.

Problem Statement

Existing evaluations either use mock servers, small task sets, or subjective judge models, so they fail to measure real-world multi-step tool orchestration. Practitioners need a large, reproducible, objective benchmark using real servers and targeted diagnostics.

Main Contribution

A large, realistic benchmark: 1,000 single-turn tasks over 36 real MCP servers and 220 tools, designed to elicit 3–6 tool calls and cross-server workflows.

Claims-based scoring: objective, partial-credit rubric that checks atomic factual claims in the final answer (trajectory-independent).

Public artifacts: a containerized evaluation harness, schemas, and a 500-task public subset; remaining 500 tasks held out for validation.

Rich diagnostics: internal metrics for discovery, parameter correctness, schema typing, error recovery, and efficiency to explain failures.

Key Findings

Top frontier models pass a majority of tasks but still leave large gaps.

NumbersBest pass rate 62.3% (Claude Opus 4.5) on 1,000 tasks

Tool-usage mistakes are the dominant failure mode.

NumbersTool Usage = 56.7% of failed tasks (avg)

Task-understanding (missing subgoals, premature stop) is the second-largest failure cause.

NumbersTask Understanding = 30.3% of failed tasks (avg)

Claims-based automatic judging aligns reasonably well with humans.

Numbers78% agreement between LLM judge and human annotators

Failure patterns vary by domain; finance and coding are hardest.

NumbersTool Usage failures: CODING 71%, FINANCIAL 64% (domain averages)

Results

Pass Rate (best model)

Value62.3%

Mean Coverage (best model)

Value78.5%

Failure distribution (average over models)

ValueTool Usage 56.7% | Task Understanding 30.3% | Response Quality 8.5% | Logical Errors 4.5%

Judge-human agreement

Value78%

Who Should Care

What To Try In 7 Days

Run the 500-task public subset on your target models to get a quick, domain-relevant baseline.

Add lightweight schema checks and parameter validation layers before tool calls to cut 'incorrect parameter' failures.

Implement a small planning step that enumerates subgoals; measure reduction in 'partial completion' errors.

Agent Features

Memory

  • read-only operations (no writes in benchmark)
  • no persistent state changes allowed

Planning

  • single-turn multi-call planning (3–6 tool calls)
  • reference trajectories for diagnostics (not required for pass)

Tool Use

  • tool discovery under distractors
  • schema-grounded parameterization
  • cross-server orchestration
  • error recovery and retries
  • efficiency (call budgets logged, not enforced)

Frameworks

  • Model Context Protocol (MCP)

Is Agentic

true

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-turn design: does not score multi-turn recovery or clarification strategies.
  • Read-only tasks only: write operations and state mutations are excluded.
  • Efficiency not enforced: no strict call/latency budgets in primary scoring.
  • Real servers can change over time despite version pinning, affecting long-term reproducibility.
  • Claims judge still has edge cases (text normalization, paraphrase handling).

When Not To Use

  • If you need multi-turn dialog evaluation (clarify/recover over several messages).
  • If you must evaluate write/side-effect actions (creating records, sending messages).
  • If you need strict latency or call-budget constraints enforced in scoring.

Failure Modes

  • No tools called (agent fails to invoke available tools)
  • Incorrect tool selection (chooses wrong server or endpoint)
  • Incorrect tool parameters or schema violations
  • Premature stopping / partial task completion
  • Response synthesis errors (incorrectly combining tool outputs)
  • Logical/conditional errors (wrong branching decisions)

Core Entities

Models

  • Claude Opus 4.5
  • Gemini 3 Pro
  • GPT-5
  • Claude Sonnet 4.5
  • o3 Pro (high)
  • Claude Opus 4.1
  • Claude Sonnet 4
  • GLM-4.5 Air
  • Kimi K2 Instruct
  • Qwen3-235B-A22B
  • Gemini 2.5 Pro
  • GPT-4o
  • Gemini 2.5 Flash

Metrics

  • pass rate
  • mean coverage (claims coverage)
  • failure mode distribution
  • discovery precision/recall
  • parameter correctness
  • error recovery rate

Datasets

  • MCP-Atlas (1,000 tasks)
  • MCP-Atlas public subset (500 tasks)

Context Entities

Models

  • Gemini 2.5 Pro (used as automated judge)

Metrics

  • LLM-human agreement (78%) for judge reliability

Datasets

  • Leaderboards and server distribution in Appendix H

Benchmarks

  • Prior MCP benchmarks (MCP-Universe, MCPEval, MCP-Bench) compared in paper