Dr.Spider: 17 targeted perturbations reveal brittle text-to-SQL systems

Overview

Decision SnapshotNeeds Validation

The benchmark is well-scoped and publicly released, with clear metrics and multiple models evaluated. Evidence is strong for measured failure modes, but the study is limited to Spider-dev-based examples and specific PLM generation choices.

Citations8

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 60%

Authors

Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Bing Xiang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Text-to-SQL systems that appear accurate in lab tests can silently fail in real use when users phrase questions differently or when schemas store data in alternate formats. That leads to wrong query results and bad UX. Dr.Spider helps find these blind spots before deployment.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The authors build Dr.Spider, a diagnostic benchmark of 17 targeted perturbations (on databases, user questions, and SQL) based on Spider. Dr.Spider contains ~15K pre/post perturbation example pairs. Evaluation of state-of-the-art text-to-SQL systems shows large, repeatable drops: ~14% absolute performance loss overall and a ~50% regression on the hardest perturbation. The paper also analyzes which architecture choices (model size, decoder style, entity linking) help or hurt robustness.

Problem Statement

Text-to-SQL models often succeed on standard test sets but break when inputs change in realistic, task-specific ways. Existing robustness tests are narrow (single phenomena) or handcrafted. Practitioners need a systematic way to measure how changes to database schema, question wording, or small SQL edits affect real system outputs.

Main Contribution

Dr.Spider: a public robustness benchmark built on Spider with 17 perturbation types across DB, natural language question (NLQ), and SQL; ~15K paired examples.

A scalable expert-crowd-AI pipeline: crowdsourced paraphrases + OPT-66B generation + NLI filtering + expert review to create 9 task-specific NLQ perturbation categories.

Key Findings

State-of-the-art text-to-SQL models suffer meaningful accuracy drops on Dr.Spider.

NumbersOverall execution accuracy drop for best model (PICARD): 76.6% -> 65.9% (14.0% relative/10.7pt abs)

Practical UseRun Dr.Spider on your model before deployment. Expect non-trivial regressions when input wording, schema names, or small SQL conditions change.

Evidence RefAbstract / Table 3 (All row for PICARD)

DBcontent-equivalence is the single most damaging perturbation class.

NumbersPICARD relative robustness ~49.3% (≈50.7% regression vs pre-perturbation)

Practical UseTest and augment for alternative DB representations (splitting/fullname ↔ firstname+lastname, booleans ↔ text). Models need training examples that show content represented in multiple formats.

Evidence RefAbstract / Table 3 and Table 14 (DBcontent-equivalence rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	76.6% -> 65.9%	76.6% (pre-perturbation on Spider-dev)	-10.7pp	Dr.Spider (macro-average across perturbations)	Table 3 All row for PICARD (EX)	Table 3
Relative robustness (DBcontent-equivalence, PICARD)	49.3% (relative robustness accuracy)	88.7% pre	≈50.7% regression	Dr.Spider DBcontent-equivalence	Table 14 and discussion in Section 5.1	Table 14

What To Try In 7 Days

Run Dr.Spider on your current text-to-SQL model to identify immediate failure classes.

Add simple data augmentation examples for the top 1–2 failing perturbations (value format and DBcontent-equivalence).

Enable and validate an entity-linking or value-normalization step, then re-run targeted tests from Dr.Spider to measure gains/losses.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/awslabs/diagnostic-robustness-text-to-sql

Data URLs

https://github.com/awslabs/diagnostic-robustness-text-to-sql https://github.com/taoyutv/spider (Spider dataset reference)

Risks & Boundaries

Limitations

Built from Spider development set only; blind spots may exist for other domains or proprietary schemas.

NLQ perturbations depend on OPT-66B generation and filtering choices; different PLMs or prompts may yield different paraphrase distributions.

When Not To Use

As the only robustness test for domain-specific databases not covered by Spider.

To claim full robustness guarantees — Dr.Spider diagnoses common failure modes but is not exhaustive.

Failure Modes

Models overfit to string matching between NLQ tokens and DB content; value-format changes break predictions.

Models can be brittle to alternate schema representations (e.g., compound ↔ split columns, booleans ↔ text).

Core Entities

Models

PICARDT5-3BT5-3B LKT5-LARGET5-BASERATSQLGRAPPASMBOPCODEX

Metrics

AccuracyExact set match (EM)

Datasets

SpiderDr.Spider

Benchmarks

SpiderDr.Spider

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

State-of-the-art text-to-SQL models suffer meaningful accuracy drops on Dr.Spider.

DBcontent-equivalence is the single most damaging perturbation class.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding