Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
MCP-Atlas measures how reliably models can find and use production APIs. That matters when you want LLMs to automate multi-step tasks across services: the benchmark highlights that tool awareness and parameter correctness are the biggest practical risks.
Summary TLDR
MCP-Atlas is a 1,000-task benchmark that evaluates LLMs' real-world tool-use skills against 36 live MCP servers and 220 real tools. Tasks require 3–6 tool calls, often across multiple servers, and scoring is objective: a claims-based rubric gives partial credit for correct facts in the final answer. Top models reach ~50–62% pass rates; most failures come from not using the right tools or missing subgoals.
Problem Statement
Existing evaluations either use mock servers, small task sets, or subjective judge models, so they fail to measure real-world multi-step tool orchestration. Practitioners need a large, reproducible, objective benchmark using real servers and targeted diagnostics.
Main Contribution
A large, realistic benchmark: 1,000 single-turn tasks over 36 real MCP servers and 220 tools, designed to elicit 3–6 tool calls and cross-server workflows.
Claims-based scoring: objective, partial-credit rubric that checks atomic factual claims in the final answer (trajectory-independent).
Public artifacts: a containerized evaluation harness, schemas, and a 500-task public subset; remaining 500 tasks held out for validation.
Rich diagnostics: internal metrics for discovery, parameter correctness, schema typing, error recovery, and efficiency to explain failures.
Key Findings
Top frontier models pass a majority of tasks but still leave large gaps.
Tool-usage mistakes are the dominant failure mode.
Task-understanding (missing subgoals, premature stop) is the second-largest failure cause.
Claims-based automatic judging aligns reasonably well with humans.
Failure patterns vary by domain; finance and coding are hardest.
Results
Pass Rate (best model)
Mean Coverage (best model)
Failure distribution (average over models)
Judge-human agreement
Who Should Care
What To Try In 7 Days
Run the 500-task public subset on your target models to get a quick, domain-relevant baseline.
Add lightweight schema checks and parameter validation layers before tool calls to cut 'incorrect parameter' failures.
Implement a small planning step that enumerates subgoals; measure reduction in 'partial completion' errors.
Agent Features
Memory
- read-only operations (no writes in benchmark)
- no persistent state changes allowed
Planning
- single-turn multi-call planning (3–6 tool calls)
- reference trajectories for diagnostics (not required for pass)
Tool Use
- tool discovery under distractors
- schema-grounded parameterization
- cross-server orchestration
- error recovery and retries
- efficiency (call budgets logged, not enforced)
Frameworks
- Model Context Protocol (MCP)
Is Agentic
true
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single-turn design: does not score multi-turn recovery or clarification strategies.
- Read-only tasks only: write operations and state mutations are excluded.
- Efficiency not enforced: no strict call/latency budgets in primary scoring.
- Real servers can change over time despite version pinning, affecting long-term reproducibility.
- Claims judge still has edge cases (text normalization, paraphrase handling).
When Not To Use
- If you need multi-turn dialog evaluation (clarify/recover over several messages).
- If you must evaluate write/side-effect actions (creating records, sending messages).
- If you need strict latency or call-budget constraints enforced in scoring.
Failure Modes
- No tools called (agent fails to invoke available tools)
- Incorrect tool selection (chooses wrong server or endpoint)
- Incorrect tool parameters or schema violations
- Premature stopping / partial task completion
- Response synthesis errors (incorrectly combining tool outputs)
- Logical/conditional errors (wrong branching decisions)
Core Entities
Models
- Claude Opus 4.5
- Gemini 3 Pro
- GPT-5
- Claude Sonnet 4.5
- o3 Pro (high)
- Claude Opus 4.1
- Claude Sonnet 4
- GLM-4.5 Air
- Kimi K2 Instruct
- Qwen3-235B-A22B
- Gemini 2.5 Pro
- GPT-4o
- Gemini 2.5 Flash
Metrics
- pass rate
- mean coverage (claims coverage)
- failure mode distribution
- discovery precision/recall
- parameter correctness
- error recovery rate
Datasets
- MCP-Atlas (1,000 tasks)
- MCP-Atlas public subset (500 tasks)
Context Entities
Models
- Gemini 2.5 Pro (used as automated judge)
Metrics
- LLM-human agreement (78%) for judge reliability
Datasets
- Leaderboards and server distribution in Appendix H
Benchmarks
- Prior MCP benchmarks (MCP-Universe, MCPEval, MCP-Bench) compared in paper

