Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.75
Citation Count
5
Why It Matters For Business
CodeS offers near-SOTA text-to-SQL accuracy with far smaller, open models that cut inference cost and preserve data privacy; use a 7B model for fast local deployment.
Summary TLDR
The authors build CodeS, an open-source family of code-focused language models (1B, 3B, 7B, 15B) pre-trained with a 21.5GB SQL-centric corpus and tuned for text-to-SQL. They combine incremental pre-training, a schema-filter + BM25-assisted value retriever prompt, and a bi-directional data-augmentation pipeline to adapt to new databases. CodeS matches or exceeds many closed-source LLM baselines on Spider, BIRD and robustness variants while being 10–100x smaller, runs in ~1.1s for the 7B model, and is released with code and data.
Problem Statement
Closed-source LLMs (GPT-4, ChatGPT) lead in text-to-SQL accuracy but pose privacy, cost, and customization limits. Smaller open models lack SQL-focused data and struggle with schema linking and cross-domain adaptation. The paper asks: can a much smaller open model reach SOTA text-to-SQL performance and stay practical to deploy?
Main Contribution
CodeS: an open-source family of models (1B/3B/7B/15B) pre-trained from StarCoder with SQL-focused data.
Incremental pre-training on a curated 21.5GB corpus (11GB SQL, 6GB NL-to-code, 4.5GB NL) to boost SQL generation.
A practical database-prompt pipeline: schema filter, BM25 value retriever, and metadata inclusion.
A bi-directional data augmentation method (question→SQL and SQL→question) for rapid new-domain adaptation with few annotations.
Extensive evaluation showing CodeS achieves new SOTA among open models and strong results versus closed LLMs on multiple benchmarks.
Key Findings
Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.
Fine-tuned CodeS achieves top performance on standard text-to-SQL benchmarks with much smaller models.
CodeS shows major gains on the harder BIRD benchmark and beats prior prompting methods.
Prompt construction (schema filter + BM25 + LCS) improves schema linking speed and accuracy for large databases.
CodeS runs far faster than API-based GPT-4 prompting and fits local GPUs at reasonable sizes.
Results
Spider dev (few-shot, 5-shot)
SFT
SFT
Inference latency
Real-world adaptation (Bank-Financials)
Who Should Care
What To Try In 7 Days
Run CodeS-7B locally on a sample DB to compare latency and accuracy vs your current API-based pipeline.
Apply the schema filter + BM25 value retriever to your prompt pipeline to reduce input size and speed up queries.
Generate a small set (20–30) of real user questions and use the bi-directional augmentation to produce training pairs for quick fine-tuning.
Optimization Features
Token Efficiency
- schema filtering to reduce prompt tokens
Infra Optimization
- practical VRAM targets: 7B≈20GB, 15B≈35GB float16
Model Optimization
- incremental pre-training on domain-specific data
System Optimization
- FlashAttention-2 for long contexts
Training Optimization
- mixed-data epochs: more SQL data, fewer NL/code epochs
- AdamW, cosine lr decay, DeepSpeed ZeRO-3
Inference Optimization
- smaller model sizes (1B–15B) for faster latency
- use float16 for deployment
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- CodeS-15B shows signs of slight overfitting to Spider dev versus 7B (Section 9.3).
- Running the largest model needs 35GB GPU in float16; multi-model fine-tuning per database is costly.
- BIRD performance still depends on external knowledge in some cases; realistic external-knowledge supply can be hard.
When Not To Use
- When you lack GPU memory to host at least the 7B float16 model (≈20GB).
- When you require solutions that rely on extremely large context windows beyond current model limits.
- When you need a zero-shot solution without any domain examples and cannot tolerate prompt engineering.
Failure Modes
- Schema-linking mistakes when schema is highly ambiguous despite comments.
- DBcontent-equivalence perturbations can reduce accuracy (noted in Dr.Spider DB perturbations).
- Value matching failures if representative values are missing or retrieval index is incomplete.
- Overfitting during fine-tuning on small domain-specific datasets if augmentation is low quality.
Core Entities
Models
- CodeS-1B
- CodeS-3B
- CodeS-7B
- CodeS-15B
- StarCoder
- StarCoderBase
Metrics
- Accuracy
- Valid efficiency score (VES)
- Human evaluation (HE)
Datasets
- Spider
- BIRD
- Spider-DK
- Spider-Syn
- Spider-Realistic
- Dr.Spider
- Bank-Financials
- Aminer-Simplified
- NL-SQL-458K
Benchmarks
- Spider
- BIRD
- Dr.Spider

