Overview
DSBench is a practical, execution‑based benchmark showing clear gaps; use it to prioritize tooling (execution, file access, long-context handling), but expect agent code and outputs to need debugging and human review.
Citations1
Evidence Strength0.80
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 30%
Novelty: 40%
Why It Matters For Business
Realistic data science tasks expose that current agents often fail or produce weak models; businesses should treat agent outputs as assistive drafts and invest in execution environments and verification.
Who Should Care
Summary TLDR
DSBench is a realistic benchmark for data‑science agents built from real competition problems: 466 data‑analysis tasks (ModelOff) and 74 data‑modeling competitions (Kaggle). Tasks include long textual context, multimodal inputs (Excel, tables, images), multi‑table files, and end‑to‑end model building. Evaluations use execution-based checks plus a normalized Relative Performance Gap (RPG) for modeling. State‑of‑the‑art agents struggle: best agent achieves 34.12% accuracy on analysis tasks and 34.74% RPG on modeling tasks, well below human performance.
Problem Statement
Existing data‑science benchmarks are simplified (short instructions, single modality, code-only tests) and do not reflect real workflows that need long context, multimodal files, code execution, multi-table reasoning, and end‑to‑end model building. DSBench fills that gap with competition-derived, execution-evaluated tasks.
Main Contribution
DSBench dataset: 466 data‑analysis tasks (ModelOff) + 74 data‑modeling competitions (Kaggle) with realistic files and multimodal context.
Relative Performance Gap (RPG): a normalized metric to compare heterogeneous modeling metrics across Kaggle tasks.
Key Findings
Top agent solves only about one third of data‑analysis questions.
Modeling solutions by agents reach roughly one third of the gap to human best scores.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 34.12% | Human 64.06% | -29.94 pp | 466 ModelOff questions | AutoGen (GPT-4o) achieved 34.12% task-level accuracy | Table 4 |
| Data modeling Relative Performance Gap (RPG) | 34.74% | Human RPG 65.02% | -30.28 pp | 74 Kaggle competitions | AutoGen (GPT-4o) RPG = 34.74% | Table 6 |
What To Try In 7 Days
Run DSBench (subset) on your agent to measure real‑task gaps.
Add an execution environment (local shell or notebook) so agents can run and validate code.
Implement a short context‑summarizer to reduce long prompt load before agent reasoning.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Tasks reflect competition formats (ModelOff / Kaggle) and may not cover all enterprise workflows.
Human baseline code was runnable for only 22 modeling competitions, limiting some human comparisons.
When Not To Use
When you only need small toy code completion tests or single-line API calls.
When evaluating pure language benchmarks without file execution.
Failure Modes
Misinterpretation of data fields (e.g., swapping ID vs code).
Failure to identify or load the correct table or sheet from files.

