Overview
The benchmark and baselines are useful for research and prototyping, but execution accuracy and data factuality are low and token costs are nontrivial, so plan extra verification and caching before deploying.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals12
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 35%
Novelty: 60%
Why It Matters For Business
Combining SQL and LLMs can answer questions that databases alone cannot, but current methods are error-prone and costly; invest in verification, caching, and prompt design before production use.
Who Should Care
Summary TLDR
This paper introduces SWAN, a 120-question benchmark that tests queries needing both database rows and world knowledge. It presents HQDL (schema expansion + LLMs) and evaluates BlendSQL-style UDFs. On SWAN, GPT-4 Turbo (5-shot) reaches 40.0% execution accuracy and 48.2% data factuality (F1). The work shows hybrid queries are promising but currently unreliable and costly; it lists optimization paths like caching, predicate pushdown, and RAG.
Problem Statement
Relational databases use a closed-world assumption and cannot answer questions that need knowledge outside stored rows. There is no cross-domain benchmark or clear baselines for combining SQL and LLMs to answer such "beyond-database" questions.
Main Contribution
SWAN benchmark: 120 beyond-database questions across 4 real-world databases (European Football, Formula One, California Schools, Superhero).
HQDL: a baseline that expands schema, uses LLMs to materialize missing columns, then runs normal SQL.
Key Findings
SWAN created 120 beyond-database questions across 4 curated databases.
HQDL with GPT-4 Turbo (5-shot) achieves 40.0% execution accuracy on SWAN.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | GPT-4 Turbo 5-shot: 40.0% | GPT-4 Turbo 0-shot: 31.6% | +8.4 ppt | SWAN (overall) | Table 2 reports overall EX by model and shots | Table 2 |
| Data factuality (average F1) | GPT-4 Turbo 5-shot: 48.2% | GPT-3.5 Turbo 5-shot: 42.7% | +5.5 ppt | SWAN (average cells) | Table 4 shows F1 for generated data under few-shot settings | Table 4 |
What To Try In 7 Days
Run SWAN on your pipeline to measure hybrid-query gaps.
Add a few-shot prompt template and test execution accuracy improvements.
Materialize a small schema expansion (HQDL) for one table and measure token/cost savings vs per-query LLM calls.
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Moderate accuracy: best EX 40% and F1 ~48% on evaluated models.
Benchmarks limited to 4 curated domains and 120 queries.
When Not To Use
For mission-critical answers that must be correct with high confidence.
In low-latency systems without parallel LLM execution and caching.
Failure Modes
LLM hallucinations produce wrong or inconsistent cell values.
Format errors break parser and data extraction steps.

