Survey of LLM-based text-to-SQL with a focus on Retrieval-Augmented and Graph RAG solutions

October 1, 20247 min

Overview

Decision SnapshotNeeds Validation

The survey compiles many recent systems and datasets and makes a case for Graph RAG; the ideas are practical but often lack large-scale published numbers and incur latency and implementation complexity.

Citations2

Evidence Strength0.60

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 1/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/0

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Ali Mohammadjafari, Anthony S. Maida, Raju Gottumukkala

Links

Abstract / PDF

Why It Matters For Business

RAG and especially Graph RAG let teams query unfamiliar databases without costly fine-tuning, but they increase system complexity and latency.

Who Should Care

Summary TLDR

This 15-page survey traces text-to-SQL systems from rule-based parsers to modern LLMs and retrieval-augmented generation (RAG). It catalogs datasets and metrics, compares in-context prompting, fine-tuning, and RAG approaches, and argues that Graph RAG—indexing schema and documents as a graph—improves schema linking, multi-hop reasoning, and cross-domain generalization while trading off latency and system complexity.

Problem Statement

Non-technical users need database answers but writing SQL is hard. Existing text-to-SQL models struggle with schema linking, ambiguous or multi-turn queries, rare SQL constructs, and cross-domain generalization. The survey asks: how do LLMs plus retrieval systems (especially Graph RAG) reduce these gaps, and what are the practical trade-offs?

Main Contribution

A concise history of text-to-SQL methods from rule-based to LLM-era pipelines.

A taxonomy that compares in-context learning, fine-tuning, and RAG-based systems for text-to-SQL.

Key Findings

RAG improves cross-domain generalization and zero-/few-shot performance by fetching schema-specific context.

Practical UsePrototype a simple vector DB+retriever for unseen schemas before building costly fine-tuning; expect better zero-shot results on new databases.

Evidence RefSec II-B, Sec III-A (cross-domain discussion)

Graph RAG organizes schema and documents into a knowledge graph to improve schema linking and multihop reasoning.

Practical UseIf your application needs reliable joins, nested queries, or cross-table reasoning, invest in a graph-indexed retrieval layer to reduce schema ambiguity.

Evidence RefSec IV (Graph RAG novelty and modularity discussions)

What To Try In 7 Days

Build a small prototype: wire a vector DB to an off-the-shelf LLM and evaluate execution accuracy on a sample schema.

Measure retriever recall for schema metadata and tune the retriever before changing prompts or models.

If queries need cross-table reasoning, test a simple graph index (table/column nodes + FK edges) to see schema-link improvements.

Optimization Features

System Optimization
Index pre-partitioning for graph retrieval (modularity/Leiden)Cache retrieved schema snippets

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

High runtime cost and latency from retrieval and iterative loops.

Most systems struggle to adapt to frequent schema changes without retraining or incremental updates.

When Not To Use

Low-latency applications that cannot tolerate retrieval latency.

Very small or static schemas where fine-tuning would be cheaper and faster.

Failure Modes

Incorrect schema linking causing wrong joins or wrong tables returned.

Retriever returns irrelevant or stale context, leading to hallucinated SQL.

Core Entities

Models

GPT-4GPT-3.5CodexLLaMALLaMA2DataGpt-SQL-7BCodex/Code models (general)

Metrics

AccuracyExact Match (EM)Component Matching (CM)Valid Efficiency Score (VES)

Datasets

SpiderWikiSQLCoSQLSParCBIRDSpider-DKKaggleDBAADVETAGeoQueryATISmSpiderDuSQL

Benchmarks

SpiderWikiSQLBIRDCoSQLSParCADVETAReal-World/Industry benchmarks (Financial SQL, ClinicalDB)

Context Entities

Models

GPT-family in-context promptingFine-tuned LLMs (DELlM, SQL-GEN variants)

Metrics

Recall/Precision of retrieverLatency and efficiency measures

Datasets

Spider-variants (e.g., Spider-Syn)Benchmarks used for cross-lingual/multi-lingual testing

Benchmarks

Table retrieval benchmarks (TARGET)Domain-specific benchmarks (Financial SQL, Nibiru)