DSBench: a realistic benchmark testing data‑science agents on ModelOff and Kaggle tasks

September 12, 20247 min

Overview

Decision SnapshotNeeds Validation

DSBench is a practical, execution‑based benchmark showing clear gaps; use it to prioritize tooling (execution, file access, long-context handling), but expect agent code and outputs to need debugging and human review.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 40%

Authors

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Realistic data science tasks expose that current agents often fail or produce weak models; businesses should treat agent outputs as assistive drafts and invest in execution environments and verification.

Who Should Care

Summary TLDR

DSBench is a realistic benchmark for data‑science agents built from real competition problems: 466 data‑analysis tasks (ModelOff) and 74 data‑modeling competitions (Kaggle). Tasks include long textual context, multimodal inputs (Excel, tables, images), multi‑table files, and end‑to‑end model building. Evaluations use execution-based checks plus a normalized Relative Performance Gap (RPG) for modeling. State‑of‑the‑art agents struggle: best agent achieves 34.12% accuracy on analysis tasks and 34.74% RPG on modeling tasks, well below human performance.

Problem Statement

Existing data‑science benchmarks are simplified (short instructions, single modality, code-only tests) and do not reflect real workflows that need long context, multimodal files, code execution, multi-table reasoning, and end‑to‑end model building. DSBench fills that gap with competition-derived, execution-evaluated tasks.

Main Contribution

DSBench dataset: 466 data‑analysis tasks (ModelOff) + 74 data‑modeling competitions (Kaggle) with realistic files and multimodal context.

Relative Performance Gap (RPG): a normalized metric to compare heterogeneous modeling metrics across Kaggle tasks.

Key Findings

Top agent solves only about one third of data‑analysis questions.

NumbersTask-level accuracy 34.12% (AutoGen + GPT-4o)

Practical UseExpect agents to fail most realistic analysis questions; add verification, human review, or tooling before using agents in production.

Evidence RefTable 4

Modeling solutions by agents reach roughly one third of the gap to human best scores.

NumbersRPG 34.74% (AutoGen + GPT-4o) vs human RPG 65.02%

Practical UseAgent‑produced models are far weaker than human submissions on Kaggle-like tasks; use agent outputs as starting drafts, not final models.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy34.12%Human 64.06%-29.94 pp466 ModelOff questionsAutoGen (GPT-4o) achieved 34.12% task-level accuracyTable 4
Data modeling Relative Performance Gap (RPG)34.74%Human RPG 65.02%-30.28 pp74 Kaggle competitionsAutoGen (GPT-4o) RPG = 34.74%Table 6

What To Try In 7 Days

Run DSBench (subset) on your agent to measure real‑task gaps.

Add an execution environment (local shell or notebook) so agents can run and validate code.

Implement a short context‑summarizer to reduce long prompt load before agent reasoning.

Agent Features

Memory
Short-term multi-turn context
Planning
Multi-turn planning via AutoGen conversations
Tool Use
Local code execution (shell/Python)File system access (Excel/CSV)Notebook-style execution (Code Interpreter)
Frameworks
AutoGenCode InterpreterJupyter AI
Is Agentic

Yes

Architectures
LLM-based agents (closed & open)LVLMs (vision + language models)
Collaboration
Multi-agent conversation orchestration (AutoGen)

Optimization Features

Token Efficiency
Context summarization recommended (paper shows drop with long context)
System Optimization
Accuracy

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Tasks reflect competition formats (ModelOff / Kaggle) and may not cover all enterprise workflows.

Human baseline code was runnable for only 22 modeling competitions, limiting some human comparisons.

When Not To Use

When you only need small toy code completion tests or single-line API calls.

When evaluating pure language benchmarks without file execution.

Failure Modes

Misinterpretation of data fields (e.g., swapping ID vs code).

Failure to identify or load the correct table or sheet from files.

Core Entities

Models

GPT-4oGPT-4GPT-3.5GPT-4o miniLlama3-8bLlama3-70bLLaVAGeminiClaudeAutoGenCode Interpreter

Metrics

AccuracyRelative Performance Gap (RPG)Task Success RateRMSLERMSEROCQuadratic Weighted Kappa

Datasets

ModelOffKaggle

Benchmarks

DSBench