SheetCopilot: turn natural language into step-by-step spreadsheet actions using LLMs

May 30, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

6

Authors

Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang

Links

Abstract / PDF

Why It Matters For Business

SheetCopilot lets non-technical users automate many spreadsheet tasks by speaking plain English, lowering manual work and reducing mistakes, but it still needs human verification for critical data because full correctness is about 44% on evaluated tasks.

Summary TLDR

SheetCopilot is an LLM-driven agent that converts plain-English spreadsheet requests into executable, step-by-step spreadsheet operations. It defines a compact set of "atomic actions" (virtual APIs), uses a state-machine loop (observe → propose → revise → act) to plan and fix steps, and adds external action documentation to reduce hallucination. The authors publish a 221-task benchmark and an automated execution-based evaluation. On the full 221-task set, GPT-3.5-Turbo with SheetCopilot executes plans without runtime errors 87.3% of the time and produces fully correct final sheets 44.3% of the time, beating a VBA-code baseline (Exec@1 77.8%, Pass@1 16.3%).

Problem Statement

LLMs can reason in language but struggle to reliably control complex software. We lack a standardized interface, a robust planner that handles multi-step stateful edits, and a reproducible benchmark to measure how well LLMs actually accomplish real spreadsheet tasks.

Main Contribution

SheetCopilot agent: a prompt+state-machine system that turns natural-language spreadsheet requests into sequences of atomic spreadsheet actions.

A public benchmark and evaluation pipeline: 221 curated, realistic spreadsheet tasks (from SuperUser) and an automated execution-based correctness check.

Empirical study and ablations: show closed-loop planning, external action docs, and fine-grained atomic actions materially improve reliability over code-generation baselines.

Key Findings

High execution but moderate full correctness for GPT-3.5-Turbo with SheetCopilot.

NumbersExec@1 = 87.3%, Pass@1 = 44.3% (full 221 tasks)

SheetCopilot outperforms a VBA-code-generation baseline on both stability and correctness.

NumbersExec@1 +9.5pts, Pass@1 +28.0pts vs VBA (87.3% vs 77.8%; 44.3% vs 16.3%)

Closed-loop feedback and external docs substantially boost success.

NumbersExec@1 improved from 56.6% → 87.3% (+30.7%); external doc alone added +17.2% Exec and +10.0% Pass in ablations

Finer-grained atomic actions reduce hallucination and increase correctness.

NumbersSplitting SetFormat raised Exec@1 from 70.7% → 80.5% on format tasks; integrated CreateChart decreased Exec@1 from 91.7%

Results

Exec@1 (GPT-3.5-Turbo, full 221 tasks)

Value87.3%

Pass@1 (GPT-3.5-Turbo, full 221 tasks)

Value44.3%

BaselineVBA 16.3%

Exec@1 / Pass@1 (10% subset)

ValueGPT-3.5-Turbo Exec@1 85.0%, Pass@1 45.0%; GPT-4 Exec@1 65.0%, Pass@1 55.0%; Claude Exec@1 80.0%, Pass@1 40.0%

Efficiency (A50 / A90 for GPT-3.5-Turbo)

ValueA50 = 1.50, A90 = 3.00

Who Should Care

What To Try In 7 Days

Prompt an LLM with a small set of atomic actions and observe-propose-revise loop on 10 common spreadsheet tasks.

Add an external API-style doc for your spreadsheet actions and measure Exec@1 vs a code-generation baseline.

Move a few repetitive spreadsheet workflows (filters, chart creation, simple formulas) into step-by-step ShellCopilot-style prompts and have a human verify outputs.

Agent Features

Memory

  • short-term sheet-state feedback per step (observed state)

Planning

  • stepwise task decomposition
  • closed-loop re-planning on error feedback

Tool Use

  • atomic actions (virtual APIs)
  • external action documentation retrieval

Frameworks

  • in-context learning prompts
  • API-doc retrieval for revision

Is Agentic

true

Architectures

  • state-machine planner (observe-propose-revise-act)

Collaboration

  • human-in-the-loop verification encouraged

Optimization Features

Token Efficiency

  • Not optimized; paper notes high token use and short-horizon limits

System Optimization

  • Use of concise atomic actions to fit LM context

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High token consumption; not optimized for long-horizon or very large sheets.
  • State feedback omits chart/pivot/filter state, so the agent can recreate items it already made.
  • Evaluation environment lacks full Excel features; dataset and ground truths are curated but not exhaustive.
  • Potential correctness issues on formulas and precise range handling; human verification still required for sensitive data.

When Not To Use

  • For fully hands-off processing of sensitive financial or legal spreadsheets without human review.
  • For very long multi-step automations that exceed available model context tokens.
  • When you require full 100% deterministic formula correctness without tolerance for verification.

Failure Modes

  • Wrong formulas (common): incorrect formula structure or missing absolute references.
  • Wrong ranges: selecting incomplete or wrong copy/auto-fill ranges.
  • Incomplete solutions: stopping before all task requirements are done (token/context limits).
  • Hallucinated or invalid actions/arguments: inventing undefined APIs or illegal args.
  • Repeated output: generating the same step repeatedly until token limit is hit.

Core Entities

Models

  • GPT-3.5-Turbo
  • GPT-4
  • Claude v1
  • VBA-based method (baseline)

Metrics

  • Exec@1
  • Pass@1
  • A50
  • A90

Datasets

  • SheetCopilot core set (221 tasks)
  • 10% evaluation subset (20 tasks)
  • Seed collection from SuperUser (~13.5k filtered Q&A pairs)

Benchmarks

  • SheetCopilot spreadsheet control benchmark