SheetCopilot: turn natural language into step-by-step spreadsheet actions using LLMs

May 30, 20238 min

Overview

Decision SnapshotNeeds Validation

The system demonstrates a clear, reproducible improvement over a VBA baseline on an execution-focused benchmark, but correctness is task-dependent and token costs limit long-horizon use.

Citations6

Evidence Strength0.80

Confidence0.82

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SheetCopilot lets non-technical users automate many spreadsheet tasks by speaking plain English, lowering manual work and reducing mistakes, but it still needs human verification for critical data because full correctness is about 44% on evaluated tasks.

Who Should Care

Summary TLDR

SheetCopilot is an LLM-driven agent that converts plain-English spreadsheet requests into executable, step-by-step spreadsheet operations. It defines a compact set of "atomic actions" (virtual APIs), uses a state-machine loop (observe → propose → revise → act) to plan and fix steps, and adds external action documentation to reduce hallucination. The authors publish a 221-task benchmark and an automated execution-based evaluation. On the full 221-task set, GPT-3.5-Turbo with SheetCopilot executes plans without runtime errors 87.3% of the time and produces fully correct final sheets 44.3% of the time, beating a VBA-code baseline (Exec@1 77.8%, Pass@1 16.3%).

Problem Statement

LLMs can reason in language but struggle to reliably control complex software. We lack a standardized interface, a robust planner that handles multi-step stateful edits, and a reproducible benchmark to measure how well LLMs actually accomplish real spreadsheet tasks.

Main Contribution

SheetCopilot agent: a prompt+state-machine system that turns natural-language spreadsheet requests into sequences of atomic spreadsheet actions.

A public benchmark and evaluation pipeline: 221 curated, realistic spreadsheet tasks (from SuperUser) and an automated execution-based correctness check.

Key Findings

High execution but moderate full correctness for GPT-3.5-Turbo with SheetCopilot.

NumbersExec@1 = 87.3%, Pass@1 = 44.3% (full 221 tasks)

Practical UseExpect many LLM plans to run (few runtime errors) but verify final outputs — about half will still need fixes.

Evidence RefTable 1; Sec. 5.2

SheetCopilot outperforms a VBA-code-generation baseline on both stability and correctness.

NumbersExec@1 +9.5pts, Pass@1 +28.0pts vs VBA (87.3% vs 77.8%; 44.3% vs 16.3%)

Practical UsePrompting LLMs with atomic actions and a state-machine is a better practical route than generating VBA for cross-platform spreadsheet automation.

Evidence RefTable 1; Sec. 5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exec@1 (GPT-3.5-Turbo, full 221 tasks)87.3%+9.5% vs VBAfull core set (221 tasks)Table 1 reports Exec@1 = 87.3% for GPT-3.5-TurboSec. 5.2; Table 1
Pass@1 (GPT-3.5-Turbo, full 221 tasks)44.3%VBA 16.3%+28.0% vs VBAfull core set (221 tasks)Table 1 reports Pass@1 = 44.3% for GPT-3.5-Turbo and 16.3% for VBASec. 5.2; Table 1

What To Try In 7 Days

Prompt an LLM with a small set of atomic actions and observe-propose-revise loop on 10 common spreadsheet tasks.

Add an external API-style doc for your spreadsheet actions and measure Exec@1 vs a code-generation baseline.

Move a few repetitive spreadsheet workflows (filters, chart creation, simple formulas) into step-by-step ShellCopilot-style prompts and have a human verify outputs.

Agent Features

Memory
short-term sheet-state feedback per step (observed state)
Planning
stepwise task decompositionclosed-loop re-planning on error feedback
Tool Use
atomic actions (virtual APIs)external action documentation retrieval
Frameworks
in-context learning promptsAPI-doc retrieval for revision
Is Agentic

Yes

Architectures
state-machine planner (observe-propose-revise-act)
Collaboration
human-in-the-loop verification encouraged

Optimization Features

Token Efficiency
Not optimized; paper notes high token use and short-horizon limits
System Optimization
Use of concise atomic actions to fit LM context

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

High token consumption; not optimized for long-horizon or very large sheets.

State feedback omits chart/pivot/filter state, so the agent can recreate items it already made.

When Not To Use

For fully hands-off processing of sensitive financial or legal spreadsheets without human review.

For very long multi-step automations that exceed available model context tokens.

Failure Modes

Wrong formulas (common): incorrect formula structure or missing absolute references.

Wrong ranges: selecting incomplete or wrong copy/auto-fill ranges.

Core Entities

Models

GPT-3.5-TurboGPT-4Claude v1VBA-based method (baseline)

Metrics

Exec@1Pass@1A50A90

Datasets

SheetCopilot core set (221 tasks)10% evaluation subset (20 tasks)Seed collection from SuperUser (~13.5k filtered Q&A pairs)

Benchmarks

SheetCopilot spreadsheet control benchmark