Overview
The paper supplies concrete evaluation across multiple public benchmarks, ablations, and a released codebase, which supports deployment decisions; caveats include GPU memory needs and some overfitting for the largest model.
Citations5
Evidence Strength0.90
Confidence0.88
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 75%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
CodeS offers near-SOTA text-to-SQL accuracy with far smaller, open models that cut inference cost and preserve data privacy; use a 7B model for fast local deployment.
Who Should Care
Summary TLDR
The authors build CodeS, an open-source family of code-focused language models (1B, 3B, 7B, 15B) pre-trained with a 21.5GB SQL-centric corpus and tuned for text-to-SQL. They combine incremental pre-training, a schema-filter + BM25-assisted value retriever prompt, and a bi-directional data-augmentation pipeline to adapt to new databases. CodeS matches or exceeds many closed-source LLM baselines on Spider, BIRD and robustness variants while being 10–100x smaller, runs in ~1.1s for the 7B model, and is released with code and data.
Problem Statement
Closed-source LLMs (GPT-4, ChatGPT) lead in text-to-SQL accuracy but pose privacy, cost, and customization limits. Smaller open models lack SQL-focused data and struggle with schema linking and cross-domain adaptation. The paper asks: can a much smaller open model reach SOTA text-to-SQL performance and stay practical to deploy?
Main Contribution
CodeS: an open-source family of models (1B/3B/7B/15B) pre-trained from StarCoder with SQL-focused data.
Incremental pre-training on a curated 21.5GB corpus (11GB SQL, 6GB NL-to-code, 4.5GB NL) to boost SQL generation.
Key Findings
Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.
Fine-tuned CodeS achieves top performance on standard text-to-SQL benchmarks with much smaller models.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Spider dev (few-shot, 5-shot) | CodeS-15B TS 73.4% | StarCoder-15B TS 70.0% | +3.4% | Spider dev | Table 4 few-shot results | Table 4 |
| SFT | SFT CodeS-7B EX 85.4%, TS 80.3% | Fine-tuned SQL-PaLM EX 82.8% | +2.6% EX vs SQL-PaLM | Spider dev | Table 5 supervised fine-tuning | Table 5 |
What To Try In 7 Days
Run CodeS-7B locally on a sample DB to compare latency and accuracy vs your current API-based pipeline.
Apply the schema filter + BM25 value retriever to your prompt pipeline to reduce input size and speed up queries.
Generate a small set (20–30) of real user questions and use the bi-directional augmentation to produce training pairs for quick fine-tuning.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
CodeS-15B shows signs of slight overfitting to Spider dev versus 7B (Section 9.3).
Running the largest model needs 35GB GPU in float16; multi-model fine-tuning per database is costly.
When Not To Use
When you lack GPU memory to host at least the 7B float16 model (≈20GB).
When you require solutions that rely on extremely large context windows beyond current model limits.
Failure Modes
Schema-linking mistakes when schema is highly ambiguous despite comments.
DBcontent-equivalence perturbations can reduce accuracy (noted in Dr.Spider DB perturbations).

