CodeS: open-source 1B–15B models that match or beat much larger LLMs on text-to-SQL benchmarks

February 26, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper supplies concrete evaluation across multiple public benchmarks, ablations, and a released codebase, which supports deployment decisions; caveats include GPU memory needs and some overfitting for the largest model.

Citations5

Evidence Strength0.90

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 60%

Authors

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CodeS offers near-SOTA text-to-SQL accuracy with far smaller, open models that cut inference cost and preserve data privacy; use a 7B model for fast local deployment.

Who Should Care

Summary TLDR

The authors build CodeS, an open-source family of code-focused language models (1B, 3B, 7B, 15B) pre-trained with a 21.5GB SQL-centric corpus and tuned for text-to-SQL. They combine incremental pre-training, a schema-filter + BM25-assisted value retriever prompt, and a bi-directional data-augmentation pipeline to adapt to new databases. CodeS matches or exceeds many closed-source LLM baselines on Spider, BIRD and robustness variants while being 10–100x smaller, runs in ~1.1s for the 7B model, and is released with code and data.

Problem Statement

Closed-source LLMs (GPT-4, ChatGPT) lead in text-to-SQL accuracy but pose privacy, cost, and customization limits. Smaller open models lack SQL-focused data and struggle with schema linking and cross-domain adaptation. The paper asks: can a much smaller open model reach SOTA text-to-SQL performance and stay practical to deploy?

Main Contribution

CodeS: an open-source family of models (1B/3B/7B/15B) pre-trained from StarCoder with SQL-focused data.

Incremental pre-training on a curated 21.5GB corpus (11GB SQL, 6GB NL-to-code, 4.5GB NL) to boost SQL generation.

Key Findings

Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.

NumbersCodeS-15B 5-shot Spider TS 73.4% vs StarCoder-15B 70.0% (Table 4)

Practical UseIf you build a code LLM for text-to-SQL, add targeted SQL and NL-to-code data—small models get bigger gains than large ones.

Evidence RefTable 4 few-shot results

Fine-tuned CodeS achieves top performance on standard text-to-SQL benchmarks with much smaller models.

NumbersSFT CodeS-7B EX 85.4% on Spider dev (SOTA among compared methods) (Table 5)

Practical UseFine-tuning a 7B open model on benchmark data can match or beat larger closed LLM pipelines while costing less to run.

Evidence RefTable 5 supervised fine-tuning

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Spider dev (few-shot, 5-shot)CodeS-15B TS 73.4%StarCoder-15B TS 70.0%+3.4%Spider devTable 4 few-shot resultsTable 4
SFTSFT CodeS-7B EX 85.4%, TS 80.3%Fine-tuned SQL-PaLM EX 82.8%+2.6% EX vs SQL-PaLMSpider devTable 5 supervised fine-tuningTable 5

What To Try In 7 Days

Run CodeS-7B locally on a sample DB to compare latency and accuracy vs your current API-based pipeline.

Apply the schema filter + BM25 value retriever to your prompt pipeline to reduce input size and speed up queries.

Generate a small set (20–30) of real user questions and use the bi-directional augmentation to produce training pairs for quick fine-tuning.

Optimization Features

Token Efficiency
schema filtering to reduce prompt tokens
Infra Optimization
practical VRAM targets: 7B≈20GB, 15B≈35GB float16
Model Optimization
incremental pre-training on domain-specific data
System Optimization
FlashAttention-2 for long contexts
Training Optimization
mixed-data epochs: more SQL data, fewer NL/code epochsAdamW, cosine lr decay, DeepSpeed ZeRO-3
Inference Optimization
smaller model sizes (1B–15B) for faster latencyuse float16 for deployment

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

CodeS-15B shows signs of slight overfitting to Spider dev versus 7B (Section 9.3).

Running the largest model needs 35GB GPU in float16; multi-model fine-tuning per database is costly.

When Not To Use

When you lack GPU memory to host at least the 7B float16 model (≈20GB).

When you require solutions that rely on extremely large context windows beyond current model limits.

Failure Modes

Schema-linking mistakes when schema is highly ambiguous despite comments.

DBcontent-equivalence perturbations can reduce accuracy (noted in Dr.Spider DB perturbations).

Core Entities

Models

CodeS-1BCodeS-3BCodeS-7BCodeS-15BStarCoderStarCoderBase

Metrics

AccuracyValid efficiency score (VES)Human evaluation (HE)

Datasets

SpiderBIRDSpider-DKSpider-SynSpider-RealisticDr.SpiderBank-FinancialsAminer-SimplifiedNL-SQL-458K

Benchmarks

SpiderBIRDDr.Spider