CodeS: open-source 1B–15B models that match or beat much larger LLMs on text-to-SQL benchmarks

Overview

Decision SnapshotReady For Pilot

The paper supplies concrete evaluation across multiple public benchmarks, ablations, and a released codebase, which supports deployment decisions; caveats include GPU memory needs and some overfitting for the largest model.

Citations5

Evidence Strength0.90

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 75%

Production readiness: 80%

Novelty: 60%

Authors

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

CodeS offers near-SOTA text-to-SQL accuracy with far smaller, open models that cut inference cost and preserve data privacy; use a 7B model for fast local deployment.

Who Should Care

Product Manager ML Engineer Founder Data Scientist

Summary TLDR

The authors build CodeS, an open-source family of code-focused language models (1B, 3B, 7B, 15B) pre-trained with a 21.5GB SQL-centric corpus and tuned for text-to-SQL. They combine incremental pre-training, a schema-filter + BM25-assisted value retriever prompt, and a bi-directional data-augmentation pipeline to adapt to new databases. CodeS matches or exceeds many closed-source LLM baselines on Spider, BIRD and robustness variants while being 10–100x smaller, runs in ~1.1s for the 7B model, and is released with code and data.

Problem Statement

Closed-source LLMs (GPT-4, ChatGPT) lead in text-to-SQL accuracy but pose privacy, cost, and customization limits. Smaller open models lack SQL-focused data and struggle with schema linking and cross-domain adaptation. The paper asks: can a much smaller open model reach SOTA text-to-SQL performance and stay practical to deploy?

Main Contribution

CodeS: an open-source family of models (1B/3B/7B/15B) pre-trained from StarCoder with SQL-focused data.

Incremental pre-training on a curated 21.5GB corpus (11GB SQL, 6GB NL-to-code, 4.5GB NL) to boost SQL generation.

Key Findings

Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.

NumbersCodeS-15B 5-shot Spider TS 73.4% vs StarCoder-15B 70.0% (Table 4)

Practical UseIf you build a code LLM for text-to-SQL, add targeted SQL and NL-to-code data—small models get bigger gains than large ones.

Evidence RefTable 4 few-shot results

Fine-tuned CodeS achieves top performance on standard text-to-SQL benchmarks with much smaller models.

NumbersSFT CodeS-7B EX 85.4% on Spider dev (SOTA among compared methods) (Table 5)

Practical UseFine-tuning a 7B open model on benchmark data can match or beat larger closed LLM pipelines while costing less to run.

Evidence RefTable 5 supervised fine-tuning

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Spider dev (few-shot, 5-shot)	CodeS-15B TS 73.4%	StarCoder-15B TS 70.0%	+3.4%	Spider dev	Table 4 few-shot results	Table 4
SFT	SFT CodeS-7B EX 85.4%, TS 80.3%	Fine-tuned SQL-PaLM EX 82.8%	+2.6% EX vs SQL-PaLM	Spider dev	Table 5 supervised fine-tuning	Table 5

What To Try In 7 Days

Run CodeS-7B locally on a sample DB to compare latency and accuracy vs your current API-based pipeline.

Apply the schema filter + BM25 value retriever to your prompt pipeline to reduce input size and speed up queries.

Generate a small set (20–30) of real user questions and use the bi-directional augmentation to produce training pairs for quick fine-tuning.

Optimization Features

Token Efficiency

schema filtering to reduce prompt tokens

Infra Optimization

practical VRAM targets: 7B≈20GB, 15B≈35GB float16

Model Optimization

incremental pre-training on domain-specific data

System Optimization

FlashAttention-2 for long contexts

Training Optimization

mixed-data epochs: more SQL data, fewer NL/code epochsAdamW, cosine lr decay, DeepSpeed ZeRO-3

Inference Optimization

smaller model sizes (1B–15B) for faster latencyuse float16 for deployment

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/RUCKBReasoning/codes

Data URLs

https://github.com/RUCKBReasoning/codes

Risks & Boundaries

Limitations

CodeS-15B shows signs of slight overfitting to Spider dev versus 7B (Section 9.3).

Running the largest model needs 35GB GPU in float16; multi-model fine-tuning per database is costly.

When Not To Use

When you lack GPU memory to host at least the 7B float16 model (≈20GB).

When you require solutions that rely on extremely large context windows beyond current model limits.

Failure Modes

Schema-linking mistakes when schema is highly ambiguous despite comments.

DBcontent-equivalence perturbations can reduce accuracy (noted in Dr.Spider DB perturbations).

Core Entities

Models

CodeS-1BCodeS-3BCodeS-7BCodeS-15BStarCoderStarCoderBase

Metrics

AccuracyValid efficiency score (VES)Human evaluation (HE)

Datasets

SpiderBIRDSpider-DKSpider-SynSpider-RealisticDr.SpiderBank-FinancialsAminer-SimplifiedNL-SQL-458K

Benchmarks

SpiderBIRDDr.Spider

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.

Fine-tuned CodeS achieves top performance on standard text-to-SQL benchmarks with much smaller models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding