Overview
The paper runs controlled ablations across LLMs and provides leaderboard results; findings are actionable but rely on access to high-quality LLMs (GPT-4) for top performance.
Citations23
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
DAIL-SQL gives a practical recipe to improve Text-to-SQL accuracy while cutting token cost; that reduces API spend and speeds up production query interfaces.
Who Should Care
Summary TLDR
This paper runs a systematic benchmark of prompt engineering choices for Text-to-SQL on LLMs and proposes DAIL-SQL: use code-like schema prompts, select examples by both masked-question and predicted-query similarity, and show question–SQL pairs (no schema) in examples. DAIL-SQL reaches a new top score on the Spider leaderboard (86.6% execution accuracy with GPT-4 + self-consistency). The paper also studies open-source LLMs and supervised fine-tuning: fine-tuning boosts zero-shot performance but tends to hurt the model’s ability to learn from in-context examples. The authors measure token and dollar costs and recommend trade-offs for practical deployments.
Problem Statement
There is no systematic, cross-model benchmark for how to prompt LLMs to translate natural language questions into SQL. Teams lack clear guidance on representations, which examples to pick, how to arrange them, and how to balance accuracy vs. token (cost) efficiency. Open-source LLMs and supervised fine-tuning are also underexplored for Text-to-SQL.
Main Contribution
Systematic, controlled comparison of question representations, example selection, and example organization across several LLMs.
DAIL-SQL method: Code-style schema prompts + DAIL selection (question+query similarity) + DAIL organization (question–SQL pairs without schema).
Key Findings
DAIL-SQL sets a new Spider top with GPT-4 and self-consistency.
DAIL selection (question + predicted-query similarity) improves few-shot selection versus prior heuristics.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 86.6% (DAIL-SQL + GPT-4 + self-consistency) | 85.3% (previous SOTA DIN-SQL + GPT-4) | +1.3% | Spider test / leaderboard | Leaderboard submission and Section F.1 | F.1 |
| Accuracy | 83.5% (DAIL-SQL with GPT-4, DAIL organization few-shot) | 72.3% (GPT-4 zero-shot baseline in paper) | +11.2% (few-shot gain) | Spider-dev | Sec. 4.3.2, Fig.4, C.3 | Sec.4.3 |
What To Try In 7 Days
Swap natural-language schema to code-style CREATE TABLE prompts and add a short rule like 'SQL only' to improve output precision.
Implement DAIL selection: mask DB tokens, embed questions, and filter candidates by predicted-query similarity.
Test DAIL organization (question + SQL examples without schema) to save tokens while keeping usefulness—measure API cost vs. accuracy.
Optimization Features
Token Efficiency
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Fine-tuning experiments use only Spider train split; broader SFT data may change results.
Only two instruction-rule variants were tested; other instruction designs might help more.
When Not To Use
If you cannot afford GPT-4-level API costs or large self-consistency voting budgets.
On workloads with extremely large or many-table schemas where prompt context will overflow tokens.
Failure Modes
Open-source LLMs can lag far behind API LLMs in raw in-context ability unless extensively fine-tuned.
Supervised fine-tuning may overfit to the chosen prompt format and make subsequent few-shot updates ineffective.

