AP-SQL: combine a small fine-tuned schema filter, example retrieval, and thought-style prompts to run Text-to-SQL with lower cost

June 4, 20256 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Zetong Tang, Qian Ma, Di Wu

Links

Abstract / PDF

Why It Matters For Business

AP-SQL offers a practical way to run reliable Text-to-SQL with smaller models and lower inference cost by pruning schema context, reusing examples, and using structured prompts.

Summary TLDR

AP-SQL is a modular Text-to-SQL pipeline that targets low-resource settings. It fine-tunes a small Qwen model to filter schemas, retrieves Top-K example NL–SQL pairs (K=3) for in-context help, then uses prompt-driven schema linking and two prompt styles — Chain-of-Thought for simple queries and Graph-of-Thought for complex queries — to generate SQL. On the Spider benchmark AP-SQL yields small but consistent gains in Execution Accuracy (EX) and Test Suite (TS) over prior prompt-based systems across several LLMs. The system reduces prompt size by keeping only top-3 tables and top-3 columns per table and decouples schema linking from generation to lower inference cost.

Problem Statement

Text-to-SQL needs accurate schema grounding and reasoning, but deploying high-performing systems in constrained environments is hard: large closed models are costly and opaque, and small open models lack robust schema linking and multi-step reasoning.

Main Contribution

A modular pipeline that separates schema filtering, retrieval-augmented example prompting, schema linking, and final SQL generation.

A supervised fine-tuned Qwen model (reported as Qwen3B / Qwen-7B variants) used as a fast schema filter to select top-3 tables and top-3 columns per table, reducing prompt length.

A retrieval module that supplies Top-K = 3 NL–SQL examples from an example library to the large model for few-shot reasoning.

Custom prompt templates: Chain-of-Thought (CoT) for single-table/simple queries and Graph-of-Thought (GoT) for complex multi-table queries.

A schema-link scoring step (scale 1–10) with thresholding (reported threshold = 6) and a voting step for column selection.

Key Findings

AP-SQL gives consistent EX and TS gains on Spider across evaluated LLMs.

NumbersGPT-4o: EX 89.7% vs E-SQL 88.6% (+1.1); TS 82.6% vs 79.4% (+3.2)

Smaller models also benefit, though gains are smaller.

NumbersQwen-7B: EX 68.3% vs E-SQL 67.8% (+0.5); TS 60.8% vs 60.4% (+0.4)

Schema filtering was fine-tuned on ≈80,000 question-schema pairs using 2×4090 GPUs.

NumbersDataset ≈80k pairs; training: 10 epochs, lr=5e-5, batch=16

Prompt and retrieval design choices are part of the performance gains.

NumbersRAG uses K=3; schema link threshold set to 6; top-3 tables and top-3 columns kept

Results

Accuracy

Value68.3%

BaselineE-SQL Qwen-7B 67.8%

Accuracy

Value60.8%

BaselineE-SQL Qwen-7B 60.4%

Accuracy

Value72.4%

BaselineE-SQL Llama-8B 70.2%

Accuracy

Value64.1%

BaselineE-SQL Llama-8B 63.3%

Accuracy

Value83.2%

BaselineE-SQL GPT-4o-mini 82.6%

Accuracy

Value75.8%

BaselineE-SQL GPT-4o-mini 72.4%

Accuracy

Value89.7%

BaselineE-SQL GPT-4o 88.6%

Accuracy

Value82.6%

BaselineE-SQL GPT-4o 79.4%

Who Should Care

What To Try In 7 Days

Build a simple schema filter: fine-tune a small Qwen model on a few thousand annotated question-schema pairs and test top-3 table/column pruning.

Create a small NL–SQL example library and implement Top-K=3 retrieval to prepend to prompts.

Prototype CoT prompts for simple queries and a graph-style prompt for multi-table examples and compare EX/TS on a validation subset of Spider-like queries.

Optimization Features

Token Efficiency

  • Context Compression
  • Token Budgeting

Infra Optimization

  • fits on 2x4090 GPUs

Model Optimization

  • efficient_finetuning

System Optimization

  • RAG with Top-K examples

Training Optimization

  • supervised_finetuning_small_model

Inference Optimization

  • decoupled_schema_linking
  • prompt_pruning

Reproducibility

Data Urls

  • Spider dataset (public benchmark)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation only on Spider; no cross-dataset generalization shown.
  • No public code or configuration links provided to reproduce results exactly.
  • Paper lacks detailed ablation results for each module (e.g., filter vs. CoT vs. GoT).

When Not To Use

  • If you can run very large closed models directly and cost is not a concern.
  • If you need end-to-end learned parsers trained on paired SQL for specific production schema without prompt engineering.

Failure Modes

  • Schema filter misses relevant tables/columns and breaks final SQL.
  • Retrieved examples can mislead the model if example library is low quality.
  • Prompt-based generation can still hallucinate or produce semantically incorrect joins.

Core Entities

Models

  • Qwen-7B
  • Qwen3B (fine-tuned filter)
  • Llama-8B
  • GPT-4o-mini
  • GPT-4o

Metrics

  • Accuracy

Datasets

  • Spider

Benchmarks

  • Spider

Context Entities

Models

  • E-SQL
  • ACT-SQL
  • C3-SQL
  • DIN-SQL