SWAN: the first benchmark and baselines for mixing SQL databases with LLMs

August 1, 20247 min

Overview

Production Readiness

0.35

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Fuheng Zhao, Divyakant Agrawal, Amr El Abbadi

Links

Abstract / PDF

Why It Matters For Business

Combining SQL and LLMs can answer questions that databases alone cannot, but current methods are error-prone and costly; invest in verification, caching, and prompt design before production use.

Summary TLDR

This paper introduces SWAN, a 120-question benchmark that tests queries needing both database rows and world knowledge. It presents HQDL (schema expansion + LLMs) and evaluates BlendSQL-style UDFs. On SWAN, GPT-4 Turbo (5-shot) reaches 40.0% execution accuracy and 48.2% data factuality (F1). The work shows hybrid queries are promising but currently unreliable and costly; it lists optimization paths like caching, predicate pushdown, and RAG.

Problem Statement

Relational databases use a closed-world assumption and cannot answer questions that need knowledge outside stored rows. There is no cross-domain benchmark or clear baselines for combining SQL and LLMs to answer such "beyond-database" questions.

Main Contribution

SWAN benchmark: 120 beyond-database questions across 4 real-world databases (European Football, Formula One, California Schools, Superhero).

HQDL: a baseline that expands schema, uses LLMs to materialize missing columns, then runs normal SQL.

Hybrid Query UDFs: provided BlendSQL-style hybrid queries to call LLMs from SQL directly.

Evaluation of GPT-3.5 and GPT-4 Turbo on SWAN with metrics for execution accuracy, data factuality (F1), and token costs.

Analysis of optimization opportunities: caching, predicate pushdown, batching, parallel LLM calls, and RAG/fine-tuning suggestions.

Key Findings

SWAN created 120 beyond-database questions across 4 curated databases.

Numbers120 questions; 4 databases

HQDL with GPT-4 Turbo (5-shot) achieves 40.0% execution accuracy on SWAN.

NumbersEX = 40.0% (GPT-4 Turbo, 5-shot)

Data generated by GPT-4 Turbo (5-shot) scores 48.2% average F1 against ground truth.

NumbersF1 = 48.2% (GPT-4 Turbo, 5-shot)

Hybrid UDFs used substantially more tokens than HQDL in zero-shot runs.

NumbersHQDL input 6.3M / output 1.5M vs UDFs input 23M / output 2M

Few-shot examples improve performance: GPT-4 Turbo rose from 31.6% (0-shot) to 40.0% (5-shot).

NumbersEX 0-shot 31.6% -> 5-shot 40.0% (+8.4%)

Results

Accuracy

ValueGPT-4 Turbo 5-shot: 40.0%

BaselineGPT-4 Turbo 0-shot: 31.6%

Data factuality (average F1)

ValueGPT-4 Turbo 5-shot: 48.2%

BaselineGPT-3.5 Turbo 5-shot: 42.7%

Token usage (zero-shot)

ValueHQDL input 6.3M / output 1.5M tokens

BaselineHQ UDFs input 23M / output 2M tokens

Who Should Care

What To Try In 7 Days

Run SWAN on your pipeline to measure hybrid-query gaps.

Add a few-shot prompt template and test execution accuracy improvements.

Materialize a small schema expansion (HQDL) for one table and measure token/cost savings vs per-query LLM calls.

Optimization Features

Token Efficiency

  • prompt few-shot engineering
  • cache/reuse generated rows instead of re-querying

Infra Optimization

  • asynchronous hybrid query execution
  • improved caching layer mapping keys to table rows

System Optimization

  • materialize LLM outputs as tables (schema expansion)
  • query rewriting to reuse cached LLM outputs

Inference Optimization

  • batching LLM calls
  • parallelized LLM calls
  • predicate pushdown to reduce LLM outputs

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Moderate accuracy: best EX 40% and F1 ~48% on evaluated models.
  • Benchmarks limited to 4 curated domains and 120 queries.
  • Prompts and missing-column lists are provided manually; full automation left to future work.
  • LLM outputs suffer formatting and hallucination errors that complicate extraction.
  • Hybrid UDF caching is coarse: prompt-to-output mapping limits reuse.

When Not To Use

  • For mission-critical answers that must be correct with high confidence.
  • In low-latency systems without parallel LLM execution and caching.
  • Without verification or human review for sensitive data.

Failure Modes

  • LLM hallucinations produce wrong or inconsistent cell values.
  • Format errors break parser and data extraction steps.
  • Batching multiple rows in one LLM call increases per-row errors.
  • Cached LLM outputs miss semantically similar prompts, causing redundant calls.

Core Entities

Models

  • GPT-3.5 Turbo
  • GPT-4 Turbo
  • ChatGPT

Metrics

  • Accuracy
  • F1 (data factuality)
  • Input tokens
  • Output tokens

Datasets

  • SWAN
  • Bird
  • Spider

Benchmarks

  • SWAN
  • Bird
  • Spider