Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If your product queries very large databases, CRUSH reduces token costs and increases correct SQL generation by selecting a smaller, higher-quality schema subset to send to an LLM.
Summary TLDR
CRUSH4SQL uses an LLM to 'hallucinate' a tiny, general-purpose schema from a user question, then uses those hallucinated names as probes to retrieve a compact, high-recall subset of a very large database schema. A small combinatorial optimizer then selects a connected subset for downstream Text-to-SQL. On three new large-schema benchmarks (SpiderUnion 4.5k cols, BirdUnion 798 cols, SocialDB ~18k cols), CRUSH raises retrieval recall and improves SQL accuracy versus single-vector dense retrieval and token-level baselines.
Problem Statement
Modern Text-to-SQL models need a small high-quality subset of a client's schema to fit prompts and control cost. Dense nearest-neighbor retrieval over individual columns is brittle on very large schemas (thousands of columns). We need a way to retrieve sets of schema elements that collectively cover the question while staying small and affordable.
Main Contribution
CRUSH method: prompt an LLM to hallucinate a minimal schema for the question, use those hallucinated items as probes for dense retrieval, then run a collective subset-selection objective that maximizes coverage and schema connectivity.
Three large-schema benchmarks: SpiderUnion (4,502 columns), BirdUnion (798 columns), and SocialDB (≈18k columns) for schema-subsetting evaluation.
Empirical study showing higher recall and improved downstream Text-to-SQL accuracy compared to single embedding and token-level baselines, plus ablations of design choices and prompt variants.
Key Findings
CRUSH improves column recall at moderate budget on SpiderUnion.
CRUSH improves table/column recall on a real large warehouse.
Better retrieval leads to better generated SQL accuracy.
Token-level (ColBERT) retrieval underperformed single-vector dense retrieval here.
Results
column recall (r@10) on SpiderUnion
table recall (r@10) on SocialDB
Exact Match / Execution Match (EM/EX) at B=10
Who Should Care
What To Try In 7 Days
Run CRUSH probe pipeline: few-shot GPT-3 hallucination → SGPT embeddings → candidate retrieval → greedy subset selection.
Compare r@10 and EM/EX using current dense-retrieval baseline on a few production queries.
Tune budget B (10–30 columns) and check SQL accuracy; beware larger budgets can hurt accuracy.
Optimization Features
Token Efficiency
- budgeted schema subset
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Hallucination is currently unguided by any compressed client schema and may miss client-specific vocabulary.
- Edge/connection weights e(d,d') are hardwired constants rather than learned.
- Relies on paid LLM embedding/ranking APIs and SGPT embeddings, with associated cost and latency.
When Not To Use
- If the database schema is small enough to fit in-context, simple inclusion may be cheaper and simpler.
- When policy forbids sending any question text to external LLM services.
- If you cannot tolerate extra API latency or cost from an LLM in the retrieval loop.
Failure Modes
- LLM hallucination can produce probes that miss domain-specific names, causing recall loss.
- Hallucinated probes can push retrieval toward common generic columns (Name, id) unless entropy discounting is used.
- Large selection budgets can degrade downstream SQL accuracy by introducing distracting schema elements.
Core Entities
Models
- text-davinci-003
- SGPT
- text-embedding-ada-002
- RESDSQL
- ColBERT
Metrics
- recall
- Exact Match (EM)
- Execution Match (EX)
Datasets
- SpiderUnion
- BirdUnion
- SocialDB
- SPIDER
- BIRD
Benchmarks
- SpiderUnion
- BirdUnion
- SocialDB

