DAIL-SQL: prompt+example selection that sets a new Spider Text-to-SQL high (86.6% EX)

August 29, 20238 min

Overview

Decision SnapshotReady For Pilot

The paper runs controlled ablations across LLMs and provides leaderboard results; findings are actionable but rely on access to high-quality LLMs (GPT-4) for top performance.

Citations23

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, Jingren Zhou

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DAIL-SQL gives a practical recipe to improve Text-to-SQL accuracy while cutting token cost; that reduces API spend and speeds up production query interfaces.

Who Should Care

Summary TLDR

This paper runs a systematic benchmark of prompt engineering choices for Text-to-SQL on LLMs and proposes DAIL-SQL: use code-like schema prompts, select examples by both masked-question and predicted-query similarity, and show question–SQL pairs (no schema) in examples. DAIL-SQL reaches a new top score on the Spider leaderboard (86.6% execution accuracy with GPT-4 + self-consistency). The paper also studies open-source LLMs and supervised fine-tuning: fine-tuning boosts zero-shot performance but tends to hurt the model’s ability to learn from in-context examples. The authors measure token and dollar costs and recommend trade-offs for practical deployments.

Problem Statement

There is no systematic, cross-model benchmark for how to prompt LLMs to translate natural language questions into SQL. Teams lack clear guidance on representations, which examples to pick, how to arrange them, and how to balance accuracy vs. token (cost) efficiency. Open-source LLMs and supervised fine-tuning are also underexplored for Text-to-SQL.

Main Contribution

Systematic, controlled comparison of question representations, example selection, and example organization across several LLMs.

DAIL-SQL method: Code-style schema prompts + DAIL selection (question+query similarity) + DAIL organization (question–SQL pairs without schema).

Key Findings

DAIL-SQL sets a new Spider top with GPT-4 and self-consistency.

Numbers86.6% execution accuracy (leaderboard, with self-consistency)

Practical UseIf you can afford GPT-4 and optional self-consistency, DAIL-SQL is a ready recipe to push accuracy on Spider to state-of-the-art.

Evidence RefSections 3.3, F.1

DAIL selection (question + predicted-query similarity) improves few-shot selection versus prior heuristics.

NumbersWith GPT-4, 5-shot DAIL selection EX 82.4% vs Random EX ~79.5% (Table 2)

Practical UseWhen choosing few-shot examples, rank candidates by masked-question embedding and filter by predicted-query similarity to pick higher-impact examples.

Evidence RefTable 2, Sec. 4.3.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy86.6% (DAIL-SQL + GPT-4 + self-consistency)85.3% (previous SOTA DIN-SQL + GPT-4)+1.3%Spider test / leaderboardLeaderboard submission and Section F.1F.1
Accuracy83.5% (DAIL-SQL with GPT-4, DAIL organization few-shot)72.3% (GPT-4 zero-shot baseline in paper)+11.2% (few-shot gain)Spider-devSec. 4.3.2, Fig.4, C.3Sec.4.3

What To Try In 7 Days

Swap natural-language schema to code-style CREATE TABLE prompts and add a short rule like 'SQL only' to improve output precision.

Implement DAIL selection: mask DB tokens, embed questions, and filter candidates by predicted-query similarity.

Test DAIL organization (question + SQL examples without schema) to save tokens while keeping usefulness—measure API cost vs. accuracy.

Optimization Features

Token Efficiency
compare prompt formats by average token count and EXDAIL organization balances tokens vs. examples
Training Optimization
SFT
Inference Optimization
token-efficient prompt design (DAIL organization)Accuracy

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Fine-tuning experiments use only Spider train split; broader SFT data may change results.

Only two instruction-rule variants were tested; other instruction designs might help more.

When Not To Use

If you cannot afford GPT-4-level API costs or large self-consistency voting budgets.

On workloads with extremely large or many-table schemas where prompt context will overflow tokens.

Failure Modes

Open-source LLMs can lag far behind API LLMs in raw in-context ability unless extensively fine-tuned.

Supervised fine-tuning may overfit to the chosen prompt format and make subsequent few-shot updates ineffective.

Core Entities

Models

GPT-4GPT-3.5-TURBOTEXT-DAVINCI-003CODE-DAVINCI-002LLaMA-7BLLaMA-13BLLaMA-33BLLaMA-2-CHAT-7BLLaMA-2-CHAT-13BLLaMA-2-CHAT-70BVicuna-7BVicuna-13BVicuna-33BCodeLLaMA-34BFalcon-40BGPT4ALL-7B

Metrics

Accuracytoken countdollar API cost

Datasets

SpiderSpider-RealisticBIRD

Benchmarks

Spider leaderboardBIRD leaderboard