Use LLM-hallucinated mini-schemas to retrieve small, high-recall DB schema subsets for Text-to-SQL

November 2, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.6

Citation Count

0

Authors

Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi, Soumen Chakrabarti

Links

Abstract / PDF

Why It Matters For Business

If your product queries very large databases, CRUSH reduces token costs and increases correct SQL generation by selecting a smaller, higher-quality schema subset to send to an LLM.

Summary TLDR

CRUSH4SQL uses an LLM to 'hallucinate' a tiny, general-purpose schema from a user question, then uses those hallucinated names as probes to retrieve a compact, high-recall subset of a very large database schema. A small combinatorial optimizer then selects a connected subset for downstream Text-to-SQL. On three new large-schema benchmarks (SpiderUnion 4.5k cols, BirdUnion 798 cols, SocialDB ~18k cols), CRUSH raises retrieval recall and improves SQL accuracy versus single-vector dense retrieval and token-level baselines.

Problem Statement

Modern Text-to-SQL models need a small high-quality subset of a client's schema to fit prompts and control cost. Dense nearest-neighbor retrieval over individual columns is brittle on very large schemas (thousands of columns). We need a way to retrieve sets of schema elements that collectively cover the question while staying small and affordable.

Main Contribution

CRUSH method: prompt an LLM to hallucinate a minimal schema for the question, use those hallucinated items as probes for dense retrieval, then run a collective subset-selection objective that maximizes coverage and schema connectivity.

Three large-schema benchmarks: SpiderUnion (4,502 columns), BirdUnion (798 columns), and SocialDB (≈18k columns) for schema-subsetting evaluation.

Empirical study showing higher recall and improved downstream Text-to-SQL accuracy compared to single embedding and token-level baselines, plus ablations of design choices and prompt variants.

Key Findings

CRUSH improves column recall at moderate budget on SpiderUnion.

Numbersr@10 = 0.83 (CRUSH) vs 0.77 (best baseline)

CRUSH improves table/column recall on a real large warehouse.

Numbersr@10 = 0.58 (CRUSH) vs 0.49 (best baseline) on SocialDB

Better retrieval leads to better generated SQL accuracy.

NumbersEM/EX at B=10: 0.52/0.60 (CRUSH) vs 0.45/0.53 (Single DPR)

Token-level (ColBERT) retrieval underperformed single-vector dense retrieval here.

Numbersr@10 SpiderUnion: ColBERT 0.72 vs SGPT single DPR 0.76

Results

column recall (r@10) on SpiderUnion

Value0.83 (CRUSH) vs 0.77 (best baseline)

BaselineSingle DPR (best)

table recall (r@10) on SocialDB

Value0.58 (CRUSH) vs 0.49 (best baseline)

BaselineSingle DPR (OpenAI)

Exact Match / Execution Match (EM/EX) at B=10

Value0.52 / 0.60 (CRUSH) vs 0.45 / 0.53 (Single DPR)

BaselineSingle DPR (OpenAI)

Who Should Care

What To Try In 7 Days

Run CRUSH probe pipeline: few-shot GPT-3 hallucination → SGPT embeddings → candidate retrieval → greedy subset selection.

Compare r@10 and EM/EX using current dense-retrieval baseline on a few production queries.

Tune budget B (10–30 columns) and check SQL accuracy; beware larger budgets can hurt accuracy.

Optimization Features

Token Efficiency

  • budgeted schema subset

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Hallucination is currently unguided by any compressed client schema and may miss client-specific vocabulary.
  • Edge/connection weights e(d,d') are hardwired constants rather than learned.
  • Relies on paid LLM embedding/ranking APIs and SGPT embeddings, with associated cost and latency.

When Not To Use

  • If the database schema is small enough to fit in-context, simple inclusion may be cheaper and simpler.
  • When policy forbids sending any question text to external LLM services.
  • If you cannot tolerate extra API latency or cost from an LLM in the retrieval loop.

Failure Modes

  • LLM hallucination can produce probes that miss domain-specific names, causing recall loss.
  • Hallucinated probes can push retrieval toward common generic columns (Name, id) unless entropy discounting is used.
  • Large selection budgets can degrade downstream SQL accuracy by introducing distracting schema elements.

Core Entities

Models

  • text-davinci-003
  • SGPT
  • text-embedding-ada-002
  • RESDSQL
  • ColBERT

Metrics

  • recall
  • Exact Match (EM)
  • Execution Match (EX)

Datasets

  • SpiderUnion
  • BirdUnion
  • SocialDB
  • SPIDER
  • BIRD

Benchmarks

  • SpiderUnion
  • BirdUnion
  • SocialDB