Survey of LLM-based text-to-SQL with a focus on Retrieval-Augmented and Graph RAG solutions

October 1, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

2

Authors

Ali Mohammadjafari, Anthony S. Maida, Raju Gottumukkala

Links

Abstract / PDF

Why It Matters For Business

RAG and especially Graph RAG let teams query unfamiliar databases without costly fine-tuning, but they increase system complexity and latency.

Summary TLDR

This 15-page survey traces text-to-SQL systems from rule-based parsers to modern LLMs and retrieval-augmented generation (RAG). It catalogs datasets and metrics, compares in-context prompting, fine-tuning, and RAG approaches, and argues that Graph RAG—indexing schema and documents as a graph—improves schema linking, multi-hop reasoning, and cross-domain generalization while trading off latency and system complexity.

Problem Statement

Non-technical users need database answers but writing SQL is hard. Existing text-to-SQL models struggle with schema linking, ambiguous or multi-turn queries, rare SQL constructs, and cross-domain generalization. The survey asks: how do LLMs plus retrieval systems (especially Graph RAG) reduce these gaps, and what are the practical trade-offs?

Main Contribution

A concise history of text-to-SQL methods from rule-based to LLM-era pipelines.

A taxonomy that compares in-context learning, fine-tuning, and RAG-based systems for text-to-SQL.

A focused review of benchmarks, datasets, and metrics used to evaluate text-to-SQL models.

A deep look at Graph RAG: how graph-structured retrieval helps schema linking and multi-hop reasoning.

A practical discussion of remaining challenges: compute cost, schema drift, robustness, privacy, and explainability.

Key Findings

RAG improves cross-domain generalization and zero-/few-shot performance by fetching schema-specific context.

Graph RAG organizes schema and documents into a knowledge graph to improve schema linking and multihop reasoning.

RAG systems add runtime cost and can be slower due to retrieval latency and iterative refinement loops.

Benchmarks use two complementary metric families: content-matching (EM, CM) and execution-based (EX, VES).

SQL knowledge is widely needed: 51.52% of professional developers report using SQL (Stack Overflow stat cited).

Numbers51.52%

RAG performance strongly depends on retrieval quality; poor retrieval degrades final SQL accuracy.

Who Should Care

What To Try In 7 Days

Build a small prototype: wire a vector DB to an off-the-shelf LLM and evaluate execution accuracy on a sample schema.

Measure retriever recall for schema metadata and tune the retriever before changing prompts or models.

If queries need cross-table reasoning, test a simple graph index (table/column nodes + FK edges) to see schema-link improvements.

Optimization Features

System Optimization

  • Index pre-partitioning for graph retrieval (modularity/Leiden)
  • Cache retrieved schema snippets

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • High runtime cost and latency from retrieval and iterative loops.
  • Most systems struggle to adapt to frequent schema changes without retraining or incremental updates.
  • Performance depends heavily on retriever quality; bad retrieval leads to wrong SQL.
  • Graph RAG can be resource intensive to build and maintain for large schemas.
  • Privacy and explainability issues remain unresolved for sensitive domains.

When Not To Use

  • Low-latency applications that cannot tolerate retrieval latency.
  • Very small or static schemas where fine-tuning would be cheaper and faster.
  • Highly sensitive data scenarios without strong access controls for retrieved context.

Failure Modes

  • Incorrect schema linking causing wrong joins or wrong tables returned.
  • Retriever returns irrelevant or stale context, leading to hallucinated SQL.
  • High-latency or failed retrievals causing timeouts or degraded UX.
  • Iterative refinement loops converge slowly or fail due to repeated execution errors.

Core Entities

Models

  • GPT-4
  • GPT-3.5
  • Codex
  • LLaMA
  • LLaMA2
  • DataGpt-SQL-7B
  • Codex/Code models (general)

Metrics

  • Accuracy
  • Exact Match (EM)
  • Component Matching (CM)
  • Valid Efficiency Score (VES)

Datasets

  • Spider
  • WikiSQL
  • CoSQL
  • SParC
  • BIRD
  • Spider-DK
  • KaggleDBA
  • ADVETA
  • GeoQuery
  • ATIS
  • mSpider
  • DuSQL

Benchmarks

  • Spider
  • WikiSQL
  • BIRD
  • CoSQL
  • SParC
  • ADVETA
  • Real-World/Industry benchmarks (Financial SQL, ClinicalDB)

Context Entities

Models

  • GPT-family in-context prompting
  • Fine-tuned LLMs (DELlM, SQL-GEN variants)

Metrics

  • Recall/Precision of retriever
  • Latency and efficiency measures

Datasets

  • Spider-variants (e.g., Spider-Syn)
  • Benchmarks used for cross-lingual/multi-lingual testing

Benchmarks

  • Table retrieval benchmarks (TARGET)
  • Domain-specific benchmarks (Financial SQL, Nibiru)