SWAN: the first benchmark and baselines for mixing SQL databases with LLMs

August 1, 20247 min

Overview

Decision SnapshotNeeds Validation

The benchmark and baselines are useful for research and prototyping, but execution accuracy and data factuality are low and token costs are nontrivial, so plan extra verification and caching before deploying.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 35%

Novelty: 60%

Authors

Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Combining SQL and LLMs can answer questions that databases alone cannot, but current methods are error-prone and costly; invest in verification, caching, and prompt design before production use.

Who Should Care

Summary TLDR

This paper introduces SWAN, a 120-question benchmark that tests queries needing both database rows and world knowledge. It presents HQDL (schema expansion + LLMs) and evaluates BlendSQL-style UDFs. On SWAN, GPT-4 Turbo (5-shot) reaches 40.0% execution accuracy and 48.2% data factuality (F1). The work shows hybrid queries are promising but currently unreliable and costly; it lists optimization paths like caching, predicate pushdown, and RAG.

Problem Statement

Relational databases use a closed-world assumption and cannot answer questions that need knowledge outside stored rows. There is no cross-domain benchmark or clear baselines for combining SQL and LLMs to answer such "beyond-database" questions.

Main Contribution

SWAN benchmark: 120 beyond-database questions across 4 real-world databases (European Football, Formula One, California Schools, Superhero).

HQDL: a baseline that expands schema, uses LLMs to materialize missing columns, then runs normal SQL.

Key Findings

SWAN created 120 beyond-database questions across 4 curated databases.

Numbers120 questions; 4 databases

Practical UseUse SWAN to benchmark hybrid DB+LLM systems and compare approaches on the same tasks.

Evidence RefSection 3.5

HQDL with GPT-4 Turbo (5-shot) achieves 40.0% execution accuracy on SWAN.

NumbersEX = 40.0% (GPT-4 Turbo, 5-shot)

Practical UseExpect many incorrect query results today; do not rely on out-of-the-box LLMs for full correctness.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGPT-4 Turbo 5-shot: 40.0%GPT-4 Turbo 0-shot: 31.6%+8.4 pptSWAN (overall)Table 2 reports overall EX by model and shotsTable 2
Data factuality (average F1)GPT-4 Turbo 5-shot: 48.2%GPT-3.5 Turbo 5-shot: 42.7%+5.5 pptSWAN (average cells)Table 4 shows F1 for generated data under few-shot settingsTable 4

What To Try In 7 Days

Run SWAN on your pipeline to measure hybrid-query gaps.

Add a few-shot prompt template and test execution accuracy improvements.

Materialize a small schema expansion (HQDL) for one table and measure token/cost savings vs per-query LLM calls.

Optimization Features

Token Efficiency
prompt few-shot engineeringcache/reuse generated rows instead of re-querying
Infra Optimization
asynchronous hybrid query executionimproved caching layer mapping keys to table rows
System Optimization
materialize LLM outputs as tables (schema expansion)query rewriting to reuse cached LLM outputs
Inference Optimization
batching LLM callsparallelized LLM callspredicate pushdown to reduce LLM outputs

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://github.com/ZhaoFuheng/SWAN/Bird benchmark (referenced as data source)

Risks & Boundaries

Limitations

Moderate accuracy: best EX 40% and F1 ~48% on evaluated models.

Benchmarks limited to 4 curated domains and 120 queries.

When Not To Use

For mission-critical answers that must be correct with high confidence.

In low-latency systems without parallel LLM execution and caching.

Failure Modes

LLM hallucinations produce wrong or inconsistent cell values.

Format errors break parser and data extraction steps.

Core Entities

Models

GPT-3.5 TurboGPT-4 TurboChatGPT

Metrics

AccuracyF1 (data factuality)Input tokensOutput tokens

Datasets

SWANBirdSpider

Benchmarks

SWANBirdSpider