DSBench: a realistic benchmark testing data‑science agents on ModelOff and Kaggle tasks

September 12, 20247 min

Overview

Production Readiness

0.3

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

1

Authors

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu

Links

Abstract / PDF

Why It Matters For Business

Realistic data science tasks expose that current agents often fail or produce weak models; businesses should treat agent outputs as assistive drafts and invest in execution environments and verification.

Summary TLDR

DSBench is a realistic benchmark for data‑science agents built from real competition problems: 466 data‑analysis tasks (ModelOff) and 74 data‑modeling competitions (Kaggle). Tasks include long textual context, multimodal inputs (Excel, tables, images), multi‑table files, and end‑to‑end model building. Evaluations use execution-based checks plus a normalized Relative Performance Gap (RPG) for modeling. State‑of‑the‑art agents struggle: best agent achieves 34.12% accuracy on analysis tasks and 34.74% RPG on modeling tasks, well below human performance.

Problem Statement

Existing data‑science benchmarks are simplified (short instructions, single modality, code-only tests) and do not reflect real workflows that need long context, multimodal files, code execution, multi-table reasoning, and end‑to‑end model building. DSBench fills that gap with competition-derived, execution-evaluated tasks.

Main Contribution

DSBench dataset: 466 data‑analysis tasks (ModelOff) + 74 data‑modeling competitions (Kaggle) with realistic files and multimodal context.

Relative Performance Gap (RPG): a normalized metric to compare heterogeneous modeling metrics across Kaggle tasks.

Comprehensive execution-based evaluation of LLMs, LVLMs and agent systems (e.g., AutoGen and Code Interpreter), showing large gaps to human performance; dataset and code released on GitHub.

Key Findings

Top agent solves only about one third of data‑analysis questions.

NumbersTask-level accuracy 34.12% (AutoGen + GPT-4o)

Modeling solutions by agents reach roughly one third of the gap to human best scores.

NumbersRPG 34.74% (AutoGen + GPT-4o) vs human RPG 65.02%

Agent frameworks that provide execution and multi‑turn orchestration improve results but cost time and money.

NumbersAutoGen+GPT-4o: 34.12% accuracy and higher runtime/cost vs model-only variants

Long input context hurts performance on data‑analysis tasks.

NumbersPerformance drops as total input length increases (Figure 5)

Modeling task completion rate is not strongly correlated with input length.

NumbersNo clear trend of task success vs input length (Figure 6)

Results

Accuracy

Value34.12%

BaselineHuman 64.06%

Data modeling Relative Performance Gap (RPG)

Value34.74%

BaselineHuman RPG 65.02%

Modeling task success rate

Value87.84%

BaselineAutoGen variants and humans

Who Should Care

What To Try In 7 Days

Run DSBench (subset) on your agent to measure real‑task gaps.

Add an execution environment (local shell or notebook) so agents can run and validate code.

Implement a short context‑summarizer to reduce long prompt load before agent reasoning.

Agent Features

Memory

  • Short-term multi-turn context

Planning

  • Multi-turn planning via AutoGen conversations

Tool Use

  • Local code execution (shell/Python)
  • File system access (Excel/CSV)
  • Notebook-style execution (Code Interpreter)

Frameworks

  • AutoGen
  • Code Interpreter
  • Jupyter AI

Is Agentic

true

Architectures

  • LLM-based agents (closed & open)
  • LVLMs (vision + language models)

Collaboration

  • Multi-agent conversation orchestration (AutoGen)

Optimization Features

Token Efficiency

  • Context summarization recommended (paper shows drop with long context)

System Optimization

  • Accuracy

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Tasks reflect competition formats (ModelOff / Kaggle) and may not cover all enterprise workflows.
  • Human baseline code was runnable for only 22 modeling competitions, limiting some human comparisons.
  • Semantic correctness judge uses an LLM; though sampled human checks were perfect, judge bias remains possible.

When Not To Use

  • When you only need small toy code completion tests or single-line API calls.
  • When evaluating pure language benchmarks without file execution.
  • When you require specialized proprietary tool integration not represented in DSBench.

Failure Modes

  • Misinterpretation of data fields (e.g., swapping ID vs code).
  • Failure to identify or load the correct table or sheet from files.
  • Incorrect problem‑solving strategy or formula selection leading to wrong outputs.
  • Generated code not executed or failing to produce submission files.

Core Entities

Models

  • GPT-4o
  • GPT-4
  • GPT-3.5
  • GPT-4o mini
  • Llama3-8b
  • Llama3-70b
  • LLaVA
  • Gemini
  • Claude
  • AutoGen
  • Code Interpreter

Metrics

  • Accuracy
  • Relative Performance Gap (RPG)
  • Task Success Rate
  • RMSLE
  • RMSE
  • ROC
  • Quadratic Weighted Kappa

Datasets

  • ModelOff
  • Kaggle

Benchmarks

  • DSBench