SheetCopilot: turn natural language into step-by-step spreadsheet actions using LLMs

Overview

Decision SnapshotNeeds Validation

The system demonstrates a clear, reproducible improvement over a VBA baseline on an execution-focused benchmark, but correctness is task-dependent and token costs limit long-horizon use.

Citations6

Evidence Strength0.80

Confidence0.82

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SheetCopilot lets non-technical users automate many spreadsheet tasks by speaking plain English, lowering manual work and reducing mistakes, but it still needs human verification for critical data because full correctness is about 44% on evaluated tasks.

Who Should Care

Product Manager ML Engineer Data Scientist

Summary TLDR

SheetCopilot is an LLM-driven agent that converts plain-English spreadsheet requests into executable, step-by-step spreadsheet operations. It defines a compact set of "atomic actions" (virtual APIs), uses a state-machine loop (observe → propose → revise → act) to plan and fix steps, and adds external action documentation to reduce hallucination. The authors publish a 221-task benchmark and an automated execution-based evaluation. On the full 221-task set, GPT-3.5-Turbo with SheetCopilot executes plans without runtime errors 87.3% of the time and produces fully correct final sheets 44.3% of the time, beating a VBA-code baseline (Exec@1 77.8%, Pass@1 16.3%).

Problem Statement

LLMs can reason in language but struggle to reliably control complex software. We lack a standardized interface, a robust planner that handles multi-step stateful edits, and a reproducible benchmark to measure how well LLMs actually accomplish real spreadsheet tasks.

Main Contribution

SheetCopilot agent: a prompt+state-machine system that turns natural-language spreadsheet requests into sequences of atomic spreadsheet actions.

A public benchmark and evaluation pipeline: 221 curated, realistic spreadsheet tasks (from SuperUser) and an automated execution-based correctness check.

Key Findings

High execution but moderate full correctness for GPT-3.5-Turbo with SheetCopilot.

NumbersExec@1 = 87.3%, Pass@1 = 44.3% (full 221 tasks)

Practical UseExpect many LLM plans to run (few runtime errors) but verify final outputs — about half will still need fixes.

Evidence RefTable 1; Sec. 5.2

SheetCopilot outperforms a VBA-code-generation baseline on both stability and correctness.

NumbersExec@1 +9.5pts, Pass@1 +28.0pts vs VBA (87.3% vs 77.8%; 44.3% vs 16.3%)

Practical UsePrompting LLMs with atomic actions and a state-machine is a better practical route than generating VBA for cross-platform spreadsheet automation.

Evidence RefTable 1; Sec. 5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exec@1 (GPT-3.5-Turbo, full 221 tasks)	87.3%	—	+9.5% vs VBA	full core set (221 tasks)	Table 1 reports Exec@1 = 87.3% for GPT-3.5-Turbo	Sec. 5.2; Table 1
Pass@1 (GPT-3.5-Turbo, full 221 tasks)	44.3%	VBA 16.3%	+28.0% vs VBA	full core set (221 tasks)	Table 1 reports Pass@1 = 44.3% for GPT-3.5-Turbo and 16.3% for VBA	Sec. 5.2; Table 1

What To Try In 7 Days

Prompt an LLM with a small set of atomic actions and observe-propose-revise loop on 10 common spreadsheet tasks.

Add an external API-style doc for your spreadsheet actions and measure Exec@1 vs a code-generation baseline.

Move a few repetitive spreadsheet workflows (filters, chart creation, simple formulas) into step-by-step ShellCopilot-style prompts and have a human verify outputs.

Agent Features

Memory

short-term sheet-state feedback per step (observed state)

Planning

stepwise task decompositionclosed-loop re-planning on error feedback

Tool Use

atomic actions (virtual APIs)external action documentation retrieval

Frameworks

in-context learning promptsAPI-doc retrieval for revision

Is Agentic

Yes

Architectures

state-machine planner (observe-propose-revise-act)

Collaboration

human-in-the-loop verification encouraged

Optimization Features

Token Efficiency

Not optimized; paper notes high token use and short-horizon limits

System Optimization

Use of concise atomic actions to fit LM context

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://sheetcopilot.github.io/

Data URLs

https://sheetcopilot.github.io/

Risks & Boundaries

Limitations

High token consumption; not optimized for long-horizon or very large sheets.

State feedback omits chart/pivot/filter state, so the agent can recreate items it already made.

When Not To Use

For fully hands-off processing of sensitive financial or legal spreadsheets without human review.

For very long multi-step automations that exceed available model context tokens.

Failure Modes

Wrong formulas (common): incorrect formula structure or missing absolute references.

Wrong ranges: selecting incomplete or wrong copy/auto-fill ranges.

Core Entities

Models

GPT-3.5-TurboGPT-4Claude v1VBA-based method (baseline)

Metrics

Exec@1Pass@1A50A90

Datasets

SheetCopilot core set (221 tasks)10% evaluation subset (20 tasks)Seed collection from SuperUser (~13.5k filtered Q&A pairs)

Benchmarks

SheetCopilot spreadsheet control benchmark

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High execution but moderate full correctness for GPT-3.5-Turbo with SheetCopilot.

SheetCopilot outperforms a VBA-code-generation baseline on both stability and correctness.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding