Dr.Spider: 17 targeted perturbations reveal brittle text-to-SQL systems

January 21, 20238 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

8

Authors

Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Bing Xiang

Links

Abstract / PDF

Why It Matters For Business

Text-to-SQL systems that appear accurate in lab tests can silently fail in real use when users phrase questions differently or when schemas store data in alternate formats. That leads to wrong query results and bad UX. Dr.Spider helps find these blind spots before deployment.

Summary TLDR

The authors build Dr.Spider, a diagnostic benchmark of 17 targeted perturbations (on databases, user questions, and SQL) based on Spider. Dr.Spider contains ~15K pre/post perturbation example pairs. Evaluation of state-of-the-art text-to-SQL systems shows large, repeatable drops: ~14% absolute performance loss overall and a ~50% regression on the hardest perturbation. The paper also analyzes which architecture choices (model size, decoder style, entity linking) help or hurt robustness.

Problem Statement

Text-to-SQL models often succeed on standard test sets but break when inputs change in realistic, task-specific ways. Existing robustness tests are narrow (single phenomena) or handcrafted. Practitioners need a systematic way to measure how changes to database schema, question wording, or small SQL edits affect real system outputs.

Main Contribution

Dr.Spider: a public robustness benchmark built on Spider with 17 perturbation types across DB, natural language question (NLQ), and SQL; ~15K paired examples.

A scalable expert-crowd-AI pipeline: crowdsourced paraphrases + OPT-66B generation + NLI filtering + expert review to create 9 task-specific NLQ perturbation categories.

Programmatic DB and SQL perturbations, including novel DBcontent-equivalence that changes how content is represented (e.g., fullname -> firstname+lastname).

A diagnostic study of SOTA text-to-SQL models (encoder-decoder families, constrained decoding, and in-context CODEX) with fine-grained analysis linking failures to design choices (model size, decoder type, entity linking).

Practical insights: larger models help; bottom-up decoders help with DB perturbations; top-down decoders help with NLQ paraphrases; entity linking improves some value-related robustness but can overfit string matching.

Key Findings

State-of-the-art text-to-SQL models suffer meaningful accuracy drops on Dr.Spider.

NumbersOverall execution accuracy drop for best model (PICARD): 76.6% -> 65.9% (14.0% relative/10.7pt abs)

DBcontent-equivalence is the single most damaging perturbation class.

NumbersPICARD relative robustness ~49.3% (≈50.7% regression vs pre-perturbation)

Value-format changes in user questions break models that rely on string matching.

NumbersValue-synonym EX drop for PICARD: 72.5% -> 53.0% (≈26.9% relative drop)

Entity linking (concatenating DB content into inputs) helps value robustness but can overfit.

NumbersT5-3B vs T5-3B+linking on EX: column-value 68.7% -> 78.1%; value-synonym 35.8% -> 46.1%

Larger pretrained models improve pre- and post-perturbation accuracy.

NumbersT5-BASE -> T5-LARGE -> T5-3B EX (All): 54.3% -> 64.2% -> 69.2% pre; post: 40.6% -> 52.4% -> 57.1%

Decoder style affects failure modes differently.

NumbersSMBOP (bottom-up) is more robust to DB perturbations; GRAPPA (top-down) is more robust to NLQ paraphrases (EM/EX trends)

Results

Accuracy

Value76.6% -> 65.9%

Baseline76.6% (pre-perturbation on Spider-dev)

Relative robustness (DBcontent-equivalence, PICARD)

Value49.3% (relative robustness accuracy)

Baseline88.7% pre

Entity-linking effect (EX on value-synonym)

Value35.8% -> 46.1%

BaselineT5-3B without linking

Scaling effect (T5 family EX, All)

ValueT5-BASE 54.3% -> T5-LARGE 64.2% -> T5-3B 69.2%

BaselineT5-BASE pre

Who Should Care

What To Try In 7 Days

Run Dr.Spider on your current text-to-SQL model to identify immediate failure classes.

Add simple data augmentation examples for the top 1–2 failing perturbations (value format and DBcontent-equivalence).

Enable and validate an entity-linking or value-normalization step, then re-run targeted tests from Dr.Spider to measure gains/losses.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Built from Spider development set only; blind spots may exist for other domains or proprietary schemas.
  • NLQ perturbations depend on OPT-66B generation and filtering choices; different PLMs or prompts may yield different paraphrase distributions.
  • Some SQL edits were constrained to keep surface indicators common; not all realistic semantic changes are covered.
  • Entity linking evaluation focused on simple concatenation style; other linking strategies were not explored in depth.

When Not To Use

  • As the only robustness test for domain-specific databases not covered by Spider.
  • To claim full robustness guarantees — Dr.Spider diagnoses common failure modes but is not exhaustive.
  • As a training set replacement for domain-specific supervised fine-tuning without further validation.

Failure Modes

  • Models overfit to string matching between NLQ tokens and DB content; value-format changes break predictions.
  • Models can be brittle to alternate schema representations (e.g., compound ↔ split columns, booleans ↔ text).
  • Entity-linking by raw concatenation can improve value accuracy but may cause over-reliance on exact matches.
  • Decoder type yields trade-offs: bottom-up helps schema-local changes; top-down helps sentence-level paraphrase robustness.

Core Entities

Models

  • PICARD
  • T5-3B
  • T5-3B LK
  • T5-LARGE
  • T5-BASE
  • RATSQL
  • GRAPPA
  • SMBOP
  • CODEX

Metrics

  • Accuracy
  • Exact set match (EM)

Datasets

  • Spider
  • Dr.Spider

Benchmarks

  • Spider
  • Dr.Spider