Overview
Production Readiness
0.35
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Combining SQL and LLMs can answer questions that databases alone cannot, but current methods are error-prone and costly; invest in verification, caching, and prompt design before production use.
Summary TLDR
This paper introduces SWAN, a 120-question benchmark that tests queries needing both database rows and world knowledge. It presents HQDL (schema expansion + LLMs) and evaluates BlendSQL-style UDFs. On SWAN, GPT-4 Turbo (5-shot) reaches 40.0% execution accuracy and 48.2% data factuality (F1). The work shows hybrid queries are promising but currently unreliable and costly; it lists optimization paths like caching, predicate pushdown, and RAG.
Problem Statement
Relational databases use a closed-world assumption and cannot answer questions that need knowledge outside stored rows. There is no cross-domain benchmark or clear baselines for combining SQL and LLMs to answer such "beyond-database" questions.
Main Contribution
SWAN benchmark: 120 beyond-database questions across 4 real-world databases (European Football, Formula One, California Schools, Superhero).
HQDL: a baseline that expands schema, uses LLMs to materialize missing columns, then runs normal SQL.
Hybrid Query UDFs: provided BlendSQL-style hybrid queries to call LLMs from SQL directly.
Evaluation of GPT-3.5 and GPT-4 Turbo on SWAN with metrics for execution accuracy, data factuality (F1), and token costs.
Analysis of optimization opportunities: caching, predicate pushdown, batching, parallel LLM calls, and RAG/fine-tuning suggestions.
Key Findings
SWAN created 120 beyond-database questions across 4 curated databases.
HQDL with GPT-4 Turbo (5-shot) achieves 40.0% execution accuracy on SWAN.
Data generated by GPT-4 Turbo (5-shot) scores 48.2% average F1 against ground truth.
Hybrid UDFs used substantially more tokens than HQDL in zero-shot runs.
Few-shot examples improve performance: GPT-4 Turbo rose from 31.6% (0-shot) to 40.0% (5-shot).
Results
Accuracy
Data factuality (average F1)
Token usage (zero-shot)
Who Should Care
What To Try In 7 Days
Run SWAN on your pipeline to measure hybrid-query gaps.
Add a few-shot prompt template and test execution accuracy improvements.
Materialize a small schema expansion (HQDL) for one table and measure token/cost savings vs per-query LLM calls.
Optimization Features
Token Efficiency
- prompt few-shot engineering
- cache/reuse generated rows instead of re-querying
Infra Optimization
- asynchronous hybrid query execution
- improved caching layer mapping keys to table rows
System Optimization
- materialize LLM outputs as tables (schema expansion)
- query rewriting to reuse cached LLM outputs
Inference Optimization
- batching LLM calls
- parallelized LLM calls
- predicate pushdown to reduce LLM outputs
Reproducibility
Code Urls
Data Urls
- https://github.com/ZhaoFuheng/SWAN/
- Bird benchmark (referenced as data source)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Moderate accuracy: best EX 40% and F1 ~48% on evaluated models.
- Benchmarks limited to 4 curated domains and 120 queries.
- Prompts and missing-column lists are provided manually; full automation left to future work.
- LLM outputs suffer formatting and hallucination errors that complicate extraction.
- Hybrid UDF caching is coarse: prompt-to-output mapping limits reuse.
When Not To Use
- For mission-critical answers that must be correct with high confidence.
- In low-latency systems without parallel LLM execution and caching.
- Without verification or human review for sensitive data.
Failure Modes
- LLM hallucinations produce wrong or inconsistent cell values.
- Format errors break parser and data extraction steps.
- Batching multiple rows in one LLM call increases per-row errors.
- Cached LLM outputs miss semantically similar prompts, causing redundant calls.
Core Entities
Models
- GPT-3.5 Turbo
- GPT-4 Turbo
- ChatGPT
Metrics
- Accuracy
- F1 (data factuality)
- Input tokens
- Output tokens
Datasets
- SWAN
- Bird
- Spider
Benchmarks
- SWAN
- Bird
- Spider

