DSBench: a realistic benchmark testing data‑science agents on ModelOff and Kaggle tasks

Overview

Decision SnapshotNeeds Validation

DSBench is a practical, execution‑based benchmark showing clear gaps; use it to prioritize tooling (execution, file access, long-context handling), but expect agent code and outputs to need debugging and human review.

Citations1

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 40%

Authors

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Realistic data science tasks expose that current agents often fail or produce weak models; businesses should treat agent outputs as assistive drafts and invest in execution environments and verification.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Engineering Lead

Summary TLDR

DSBench is a realistic benchmark for data‑science agents built from real competition problems: 466 data‑analysis tasks (ModelOff) and 74 data‑modeling competitions (Kaggle). Tasks include long textual context, multimodal inputs (Excel, tables, images), multi‑table files, and end‑to‑end model building. Evaluations use execution-based checks plus a normalized Relative Performance Gap (RPG) for modeling. State‑of‑the‑art agents struggle: best agent achieves 34.12% accuracy on analysis tasks and 34.74% RPG on modeling tasks, well below human performance.

Problem Statement

Existing data‑science benchmarks are simplified (short instructions, single modality, code-only tests) and do not reflect real workflows that need long context, multimodal files, code execution, multi-table reasoning, and end‑to‑end model building. DSBench fills that gap with competition-derived, execution-evaluated tasks.

Main Contribution

DSBench dataset: 466 data‑analysis tasks (ModelOff) + 74 data‑modeling competitions (Kaggle) with realistic files and multimodal context.

Relative Performance Gap (RPG): a normalized metric to compare heterogeneous modeling metrics across Kaggle tasks.

Key Findings

Top agent solves only about one third of data‑analysis questions.

NumbersTask-level accuracy 34.12% (AutoGen + GPT-4o)

Practical UseExpect agents to fail most realistic analysis questions; add verification, human review, or tooling before using agents in production.

Evidence RefTable 4

Modeling solutions by agents reach roughly one third of the gap to human best scores.

NumbersRPG 34.74% (AutoGen + GPT-4o) vs human RPG 65.02%

Practical UseAgent‑produced models are far weaker than human submissions on Kaggle-like tasks; use agent outputs as starting drafts, not final models.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	34.12%	Human 64.06%	-29.94 pp	466 ModelOff questions	AutoGen (GPT-4o) achieved 34.12% task-level accuracy	Table 4
Data modeling Relative Performance Gap (RPG)	34.74%	Human RPG 65.02%	-30.28 pp	74 Kaggle competitions	AutoGen (GPT-4o) RPG = 34.74%	Table 6

What To Try In 7 Days

Run DSBench (subset) on your agent to measure real‑task gaps.

Add an execution environment (local shell or notebook) so agents can run and validate code.

Implement a short context‑summarizer to reduce long prompt load before agent reasoning.

Agent Features

Memory

Short-term multi-turn context

Planning

Multi-turn planning via AutoGen conversations

Tool Use

Local code execution (shell/Python)File system access (Excel/CSV)Notebook-style execution (Code Interpreter)

Frameworks

AutoGenCode InterpreterJupyter AI

Is Agentic

Yes

Architectures

LLM-based agents (closed & open)LVLMs (vision + language models)

Collaboration

Multi-agent conversation orchestration (AutoGen)

Optimization Features

Token Efficiency

Context summarization recommended (paper shows drop with long context)

System Optimization

Accuracy

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/LiqiangJing/DSBench

Data URLs

https://github.com/LiqiangJing/DSBench

Risks & Boundaries

Limitations

Tasks reflect competition formats (ModelOff / Kaggle) and may not cover all enterprise workflows.

Human baseline code was runnable for only 22 modeling competitions, limiting some human comparisons.

When Not To Use

When you only need small toy code completion tests or single-line API calls.

When evaluating pure language benchmarks without file execution.

Failure Modes

Misinterpretation of data fields (e.g., swapping ID vs code).

Failure to identify or load the correct table or sheet from files.

Core Entities

Models

GPT-4oGPT-4GPT-3.5GPT-4o miniLlama3-8bLlama3-70bLLaVAGeminiClaudeAutoGenCode Interpreter

Metrics

AccuracyRelative Performance Gap (RPG)Task Success RateRMSLERMSEROCQuadratic Weighted Kappa

Datasets

ModelOffKaggle

Benchmarks

DSBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Top agent solves only about one third of data‑analysis questions.

Modeling solutions by agents reach roughly one third of the gap to human best scores.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding