Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
2
Why It Matters For Business
RAG and especially Graph RAG let teams query unfamiliar databases without costly fine-tuning, but they increase system complexity and latency.
Summary TLDR
This 15-page survey traces text-to-SQL systems from rule-based parsers to modern LLMs and retrieval-augmented generation (RAG). It catalogs datasets and metrics, compares in-context prompting, fine-tuning, and RAG approaches, and argues that Graph RAG—indexing schema and documents as a graph—improves schema linking, multi-hop reasoning, and cross-domain generalization while trading off latency and system complexity.
Problem Statement
Non-technical users need database answers but writing SQL is hard. Existing text-to-SQL models struggle with schema linking, ambiguous or multi-turn queries, rare SQL constructs, and cross-domain generalization. The survey asks: how do LLMs plus retrieval systems (especially Graph RAG) reduce these gaps, and what are the practical trade-offs?
Main Contribution
A concise history of text-to-SQL methods from rule-based to LLM-era pipelines.
A taxonomy that compares in-context learning, fine-tuning, and RAG-based systems for text-to-SQL.
A focused review of benchmarks, datasets, and metrics used to evaluate text-to-SQL models.
A deep look at Graph RAG: how graph-structured retrieval helps schema linking and multi-hop reasoning.
A practical discussion of remaining challenges: compute cost, schema drift, robustness, privacy, and explainability.
Key Findings
RAG improves cross-domain generalization and zero-/few-shot performance by fetching schema-specific context.
Graph RAG organizes schema and documents into a knowledge graph to improve schema linking and multihop reasoning.
RAG systems add runtime cost and can be slower due to retrieval latency and iterative refinement loops.
Benchmarks use two complementary metric families: content-matching (EM, CM) and execution-based (EX, VES).
SQL knowledge is widely needed: 51.52% of professional developers report using SQL (Stack Overflow stat cited).
RAG performance strongly depends on retrieval quality; poor retrieval degrades final SQL accuracy.
Who Should Care
What To Try In 7 Days
Build a small prototype: wire a vector DB to an off-the-shelf LLM and evaluate execution accuracy on a sample schema.
Measure retriever recall for schema metadata and tune the retriever before changing prompts or models.
If queries need cross-table reasoning, test a simple graph index (table/column nodes + FK edges) to see schema-link improvements.
Optimization Features
System Optimization
- Index pre-partitioning for graph retrieval (modularity/Leiden)
- Cache retrieved schema snippets
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- High runtime cost and latency from retrieval and iterative loops.
- Most systems struggle to adapt to frequent schema changes without retraining or incremental updates.
- Performance depends heavily on retriever quality; bad retrieval leads to wrong SQL.
- Graph RAG can be resource intensive to build and maintain for large schemas.
- Privacy and explainability issues remain unresolved for sensitive domains.
When Not To Use
- Low-latency applications that cannot tolerate retrieval latency.
- Very small or static schemas where fine-tuning would be cheaper and faster.
- Highly sensitive data scenarios without strong access controls for retrieved context.
Failure Modes
- Incorrect schema linking causing wrong joins or wrong tables returned.
- Retriever returns irrelevant or stale context, leading to hallucinated SQL.
- High-latency or failed retrievals causing timeouts or degraded UX.
- Iterative refinement loops converge slowly or fail due to repeated execution errors.
Core Entities
Models
- GPT-4
- GPT-3.5
- Codex
- LLaMA
- LLaMA2
- DataGpt-SQL-7B
- Codex/Code models (general)
Metrics
- Accuracy
- Exact Match (EM)
- Component Matching (CM)
- Valid Efficiency Score (VES)
Datasets
- Spider
- WikiSQL
- CoSQL
- SParC
- BIRD
- Spider-DK
- KaggleDBA
- ADVETA
- GeoQuery
- ATIS
- mSpider
- DuSQL
Benchmarks
- Spider
- WikiSQL
- BIRD
- CoSQL
- SParC
- ADVETA
- Real-World/Industry benchmarks (Financial SQL, ClinicalDB)
Context Entities
Models
- GPT-family in-context prompting
- Fine-tuned LLMs (DELlM, SQL-GEN variants)
Metrics
- Recall/Precision of retriever
- Latency and efficiency measures
Datasets
- Spider-variants (e.g., Spider-Syn)
- Benchmarks used for cross-lingual/multi-lingual testing
Benchmarks
- Table retrieval benchmarks (TARGET)
- Domain-specific benchmarks (Financial SQL, Nibiru)

