Overview
Production Readiness
0.3
Novelty Score
0.4
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
Realistic data science tasks expose that current agents often fail or produce weak models; businesses should treat agent outputs as assistive drafts and invest in execution environments and verification.
Summary TLDR
DSBench is a realistic benchmark for data‑science agents built from real competition problems: 466 data‑analysis tasks (ModelOff) and 74 data‑modeling competitions (Kaggle). Tasks include long textual context, multimodal inputs (Excel, tables, images), multi‑table files, and end‑to‑end model building. Evaluations use execution-based checks plus a normalized Relative Performance Gap (RPG) for modeling. State‑of‑the‑art agents struggle: best agent achieves 34.12% accuracy on analysis tasks and 34.74% RPG on modeling tasks, well below human performance.
Problem Statement
Existing data‑science benchmarks are simplified (short instructions, single modality, code-only tests) and do not reflect real workflows that need long context, multimodal files, code execution, multi-table reasoning, and end‑to‑end model building. DSBench fills that gap with competition-derived, execution-evaluated tasks.
Main Contribution
DSBench dataset: 466 data‑analysis tasks (ModelOff) + 74 data‑modeling competitions (Kaggle) with realistic files and multimodal context.
Relative Performance Gap (RPG): a normalized metric to compare heterogeneous modeling metrics across Kaggle tasks.
Comprehensive execution-based evaluation of LLMs, LVLMs and agent systems (e.g., AutoGen and Code Interpreter), showing large gaps to human performance; dataset and code released on GitHub.
Key Findings
Top agent solves only about one third of data‑analysis questions.
Modeling solutions by agents reach roughly one third of the gap to human best scores.
Agent frameworks that provide execution and multi‑turn orchestration improve results but cost time and money.
Long input context hurts performance on data‑analysis tasks.
Modeling task completion rate is not strongly correlated with input length.
Results
Accuracy
Data modeling Relative Performance Gap (RPG)
Modeling task success rate
Who Should Care
What To Try In 7 Days
Run DSBench (subset) on your agent to measure real‑task gaps.
Add an execution environment (local shell or notebook) so agents can run and validate code.
Implement a short context‑summarizer to reduce long prompt load before agent reasoning.
Agent Features
Memory
- Short-term multi-turn context
Planning
- Multi-turn planning via AutoGen conversations
Tool Use
- Local code execution (shell/Python)
- File system access (Excel/CSV)
- Notebook-style execution (Code Interpreter)
Frameworks
- AutoGen
- Code Interpreter
- Jupyter AI
Is Agentic
true
Architectures
- LLM-based agents (closed & open)
- LVLMs (vision + language models)
Collaboration
- Multi-agent conversation orchestration (AutoGen)
Optimization Features
Token Efficiency
- Context summarization recommended (paper shows drop with long context)
System Optimization
- Accuracy
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Tasks reflect competition formats (ModelOff / Kaggle) and may not cover all enterprise workflows.
- Human baseline code was runnable for only 22 modeling competitions, limiting some human comparisons.
- Semantic correctness judge uses an LLM; though sampled human checks were perfect, judge bias remains possible.
When Not To Use
- When you only need small toy code completion tests or single-line API calls.
- When evaluating pure language benchmarks without file execution.
- When you require specialized proprietary tool integration not represented in DSBench.
Failure Modes
- Misinterpretation of data fields (e.g., swapping ID vs code).
- Failure to identify or load the correct table or sheet from files.
- Incorrect problem‑solving strategy or formula selection leading to wrong outputs.
- Generated code not executed or failing to produce submission files.
Core Entities
Models
- GPT-4o
- GPT-4
- GPT-3.5
- GPT-4o mini
- Llama3-8b
- Llama3-70b
- LLaVA
- Gemini
- Claude
- AutoGen
- Code Interpreter
Metrics
- Accuracy
- Relative Performance Gap (RPG)
- Task Success Rate
- RMSLE
- RMSE
- ROC
- Quadratic Weighted Kappa
Datasets
- ModelOff
- Kaggle
Benchmarks
- DSBench

