AP-SQL: combine a small fine-tuned schema filter, example retrieval, and thought-style prompts to run Text-to-SQL with lower cost

Overview

Decision SnapshotNeeds Validation

The paper shows practical improvements on Spider using a clear pipeline, but evidence is limited to benchmark runs and a single ablation-like configuration.

Citations0

Evidence Strength0.60

Confidence0.72

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 8/8

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Zetong Tang, Qian Ma, Di Wu

Links

Abstract / PDF / Data

Why It Matters For Business

AP-SQL offers a practical way to run reliable Text-to-SQL with smaller models and lower inference cost by pruning schema context, reusing examples, and using structured prompts.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

AP-SQL is a modular Text-to-SQL pipeline that targets low-resource settings. It fine-tunes a small Qwen model to filter schemas, retrieves Top-K example NL–SQL pairs (K=3) for in-context help, then uses prompt-driven schema linking and two prompt styles — Chain-of-Thought for simple queries and Graph-of-Thought for complex queries — to generate SQL. On the Spider benchmark AP-SQL yields small but consistent gains in Execution Accuracy (EX) and Test Suite (TS) over prior prompt-based systems across several LLMs. The system reduces prompt size by keeping only top-3 tables and top-3 columns per table and decouples schema linking from generation to lower inference cost.

Problem Statement

Text-to-SQL needs accurate schema grounding and reasoning, but deploying high-performing systems in constrained environments is hard: large closed models are costly and opaque, and small open models lack robust schema linking and multi-step reasoning.

Main Contribution

A modular pipeline that separates schema filtering, retrieval-augmented example prompting, schema linking, and final SQL generation.

A supervised fine-tuned Qwen model (reported as Qwen3B / Qwen-7B variants) used as a fast schema filter to select top-3 tables and top-3 columns per table, reducing prompt length.

Key Findings

AP-SQL gives consistent EX and TS gains on Spider across evaluated LLMs.

NumbersGPT-4o: EX 89.7% vs E-SQL 88.6% (+1.1); TS 82.6% vs 79.4% (+3.2)

Practical UseIf you must run Text-to-SQL with an LLM, AP-SQL can raise correctness slightly and reduce test-suite failures, especially improving robustness measured by TS.

Evidence RefTable 1 (Results section)

Smaller models also benefit, though gains are smaller.

NumbersQwen-7B: EX 68.3% vs E-SQL 67.8% (+0.5); TS 60.8% vs 60.4% (+0.4)

Practical UseTeams using mid-size open models can get modest accuracy gains without moving to much larger costly models.

Evidence RefTable 1 (Results section)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	68.3%	E-SQL Qwen-7B 67.8%	+0.5%	Spider (eval set)	Table 1, Qwen-7B row	—
Accuracy	60.8%	E-SQL Qwen-7B 60.4%	+0.4%	Spider (eval set)	Table 1, Qwen-7B row	—

What To Try In 7 Days

Build a simple schema filter: fine-tune a small Qwen model on a few thousand annotated question-schema pairs and test top-3 table/column pruning.

Create a small NL–SQL example library and implement Top-K=3 retrieval to prepend to prompts.

Prototype CoT prompts for simple queries and a graph-style prompt for multi-table examples and compare EX/TS on a validation subset of Spider-like queries.

Optimization Features

Token Efficiency

Context CompressionToken Budgeting

Infra Optimization

fits on 2x4090 GPUs

Model Optimization

efficient_finetuning

System Optimization

RAG with Top-K examples

Training Optimization

supervised_finetuning_small_model

Inference Optimization

decoupled_schema_linkingprompt_pruning

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Spider dataset (public benchmark)

Risks & Boundaries

Limitations

Evaluation only on Spider; no cross-dataset generalization shown.

No public code or configuration links provided to reproduce results exactly.

When Not To Use

If you can run very large closed models directly and cost is not a concern.

If you need end-to-end learned parsers trained on paired SQL for specific production schema without prompt engineering.

Failure Modes

Schema filter misses relevant tables/columns and breaks final SQL.

Retrieved examples can mislead the model if example library is low quality.

Core Entities

Models

Qwen-7BQwen3B (fine-tuned filter)Llama-8BGPT-4o-miniGPT-4o

Metrics

Accuracy

Datasets

Spider

Benchmarks

Spider

Context Entities

Models

E-SQLACT-SQLC3-SQLDIN-SQL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AP-SQL gives consistent EX and TS gains on Spider across evaluated LLMs.

Smaller models also benefit, though gains are smaller.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding