SWAN: the first benchmark and baselines for mixing SQL databases with LLMs

Overview

Decision SnapshotNeeds Validation

The benchmark and baselines are useful for research and prototyping, but execution accuracy and data factuality are low and token costs are nontrivial, so plan extra verification and caching before deploying.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 35%

Novelty: 60%

Authors

Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Combining SQL and LLMs can answer questions that databases alone cannot, but current methods are error-prone and costly; invest in verification, caching, and prompt design before production use.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead Founder

Summary TLDR

This paper introduces SWAN, a 120-question benchmark that tests queries needing both database rows and world knowledge. It presents HQDL (schema expansion + LLMs) and evaluates BlendSQL-style UDFs. On SWAN, GPT-4 Turbo (5-shot) reaches 40.0% execution accuracy and 48.2% data factuality (F1). The work shows hybrid queries are promising but currently unreliable and costly; it lists optimization paths like caching, predicate pushdown, and RAG.

Problem Statement

Relational databases use a closed-world assumption and cannot answer questions that need knowledge outside stored rows. There is no cross-domain benchmark or clear baselines for combining SQL and LLMs to answer such "beyond-database" questions.

Main Contribution

SWAN benchmark: 120 beyond-database questions across 4 real-world databases (European Football, Formula One, California Schools, Superhero).

HQDL: a baseline that expands schema, uses LLMs to materialize missing columns, then runs normal SQL.

Key Findings

SWAN created 120 beyond-database questions across 4 curated databases.

Numbers120 questions; 4 databases

Practical UseUse SWAN to benchmark hybrid DB+LLM systems and compare approaches on the same tasks.

Evidence RefSection 3.5

HQDL with GPT-4 Turbo (5-shot) achieves 40.0% execution accuracy on SWAN.

NumbersEX = 40.0% (GPT-4 Turbo, 5-shot)

Practical UseExpect many incorrect query results today; do not rely on out-of-the-box LLMs for full correctness.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4 Turbo 5-shot: 40.0%	GPT-4 Turbo 0-shot: 31.6%	+8.4 ppt	SWAN (overall)	Table 2 reports overall EX by model and shots	Table 2
Data factuality (average F1)	GPT-4 Turbo 5-shot: 48.2%	GPT-3.5 Turbo 5-shot: 42.7%	+5.5 ppt	SWAN (average cells)	Table 4 shows F1 for generated data under few-shot settings	Table 4

What To Try In 7 Days

Run SWAN on your pipeline to measure hybrid-query gaps.

Add a few-shot prompt template and test execution accuracy improvements.

Materialize a small schema expansion (HQDL) for one table and measure token/cost savings vs per-query LLM calls.

Optimization Features

Token Efficiency

prompt few-shot engineeringcache/reuse generated rows instead of re-querying

Infra Optimization

asynchronous hybrid query executionimproved caching layer mapping keys to table rows

System Optimization

materialize LLM outputs as tables (schema expansion)query rewriting to reuse cached LLM outputs

Inference Optimization

batching LLM callsparallelized LLM callspredicate pushdown to reduce LLM outputs

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ZhaoFuheng/SWAN/

Data URLs

https://github.com/ZhaoFuheng/SWAN/Bird benchmark (referenced as data source)

Risks & Boundaries

Limitations

Moderate accuracy: best EX 40% and F1 ~48% on evaluated models.

Benchmarks limited to 4 curated domains and 120 queries.

When Not To Use

For mission-critical answers that must be correct with high confidence.

In low-latency systems without parallel LLM execution and caching.

Failure Modes

LLM hallucinations produce wrong or inconsistent cell values.

Format errors break parser and data extraction steps.

Core Entities

Models

GPT-3.5 TurboGPT-4 TurboChatGPT

Metrics

AccuracyF1 (data factuality)Input tokensOutput tokens

Datasets

SWANBirdSpider

Benchmarks

SWANBirdSpider

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SWAN created 120 beyond-database questions across 4 curated databases.

HQDL with GPT-4 Turbo (5-shot) achieves 40.0% execution accuracy on SWAN.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A realistic benchmark and frozen-web environment for testing web research agents

Key finding

GeneAgent: an LLM agent that queries biology databases to verify and improve gene‑set function explanations

Key finding

Route simple queries straight to fast tools; use memory + planner only for complex job-career requests to cut latency and improve accuracy.

Key finding

DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

Key finding

FinS-Pilot: a 316-query, user-driven benchmark that tests real-time financial RAG with live API data

Key finding