CodeS: open-source 1B–15B models that match or beat much larger LLMs on text-to-SQL benchmarks

February 26, 20248 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.75

Citation Count

5

Authors

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen

Links

Abstract / PDF

Why It Matters For Business

CodeS offers near-SOTA text-to-SQL accuracy with far smaller, open models that cut inference cost and preserve data privacy; use a 7B model for fast local deployment.

Summary TLDR

The authors build CodeS, an open-source family of code-focused language models (1B, 3B, 7B, 15B) pre-trained with a 21.5GB SQL-centric corpus and tuned for text-to-SQL. They combine incremental pre-training, a schema-filter + BM25-assisted value retriever prompt, and a bi-directional data-augmentation pipeline to adapt to new databases. CodeS matches or exceeds many closed-source LLM baselines on Spider, BIRD and robustness variants while being 10–100x smaller, runs in ~1.1s for the 7B model, and is released with code and data.

Problem Statement

Closed-source LLMs (GPT-4, ChatGPT) lead in text-to-SQL accuracy but pose privacy, cost, and customization limits. Smaller open models lack SQL-focused data and struggle with schema linking and cross-domain adaptation. The paper asks: can a much smaller open model reach SOTA text-to-SQL performance and stay practical to deploy?

Main Contribution

CodeS: an open-source family of models (1B/3B/7B/15B) pre-trained from StarCoder with SQL-focused data.

Incremental pre-training on a curated 21.5GB corpus (11GB SQL, 6GB NL-to-code, 4.5GB NL) to boost SQL generation.

A practical database-prompt pipeline: schema filter, BM25 value retriever, and metadata inclusion.

A bi-directional data augmentation method (question→SQL and SQL→question) for rapid new-domain adaptation with few annotations.

Extensive evaluation showing CodeS achieves new SOTA among open models and strong results versus closed LLMs on multiple benchmarks.

Key Findings

Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.

NumbersCodeS-15B 5-shot Spider TS 73.4% vs StarCoder-15B 70.0% (Table 4)

Fine-tuned CodeS achieves top performance on standard text-to-SQL benchmarks with much smaller models.

NumbersSFT CodeS-7B EX 85.4% on Spider dev (SOTA among compared methods) (Table 5)

CodeS shows major gains on the harder BIRD benchmark and beats prior prompting methods.

NumbersSFT CodeS-15B EX 52.15% on BIRD test vs ChatGPT+CoT 28.95% (+23.2%) (Table 6)

Prompt construction (schema filter + BM25 + LCS) improves schema linking speed and accuracy for large databases.

NumbersValue retriever reduces LCS calls from millions to hundreds for big DBs (conceptual speedup described in Section 6.2)

CodeS runs far faster than API-based GPT-4 prompting and fits local GPUs at reasonable sizes.

NumbersInference latency: 7B = 1.1s, 15B = 1.5s per sample; float16 VRAM: 7B=20GB, 15B=35GB (Section 9.7)

Results

Spider dev (few-shot, 5-shot)

ValueCodeS-15B TS 73.4%

BaselineStarCoder-15B TS 70.0%

SFT

ValueSFT CodeS-7B EX 85.4%, TS 80.3%

BaselineFine-tuned SQL-PaLM EX 82.8%

SFT

ValueSFT CodeS-15B EX 52.15%, VES 56.99%

BaselineChatGPT+CoT EX 28.95%

Inference latency

ValueCodeS-7B 1.1s / sample; CodeS-15B 1.5s / sample

BaselineDIN-SQL+GPT-4 ~60s / sample (reported prior work)

Real-world adaptation (Bank-Financials)

ValueSFT CodeS-7B using augmented data EX 71.4%, HE 85.7%

Baseline3-shot GPT-3.5 EX 52.7%, HE 72.5%

Who Should Care

What To Try In 7 Days

Run CodeS-7B locally on a sample DB to compare latency and accuracy vs your current API-based pipeline.

Apply the schema filter + BM25 value retriever to your prompt pipeline to reduce input size and speed up queries.

Generate a small set (20–30) of real user questions and use the bi-directional augmentation to produce training pairs for quick fine-tuning.

Optimization Features

Token Efficiency

  • schema filtering to reduce prompt tokens

Infra Optimization

  • practical VRAM targets: 7B≈20GB, 15B≈35GB float16

Model Optimization

  • incremental pre-training on domain-specific data

System Optimization

  • FlashAttention-2 for long contexts

Training Optimization

  • mixed-data epochs: more SQL data, fewer NL/code epochs
  • AdamW, cosine lr decay, DeepSpeed ZeRO-3

Inference Optimization

  • smaller model sizes (1B–15B) for faster latency
  • use float16 for deployment

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • CodeS-15B shows signs of slight overfitting to Spider dev versus 7B (Section 9.3).
  • Running the largest model needs 35GB GPU in float16; multi-model fine-tuning per database is costly.
  • BIRD performance still depends on external knowledge in some cases; realistic external-knowledge supply can be hard.

When Not To Use

  • When you lack GPU memory to host at least the 7B float16 model (≈20GB).
  • When you require solutions that rely on extremely large context windows beyond current model limits.
  • When you need a zero-shot solution without any domain examples and cannot tolerate prompt engineering.

Failure Modes

  • Schema-linking mistakes when schema is highly ambiguous despite comments.
  • DBcontent-equivalence perturbations can reduce accuracy (noted in Dr.Spider DB perturbations).
  • Value matching failures if representative values are missing or retrieval index is incomplete.
  • Overfitting during fine-tuning on small domain-specific datasets if augmentation is low quality.

Core Entities

Models

  • CodeS-1B
  • CodeS-3B
  • CodeS-7B
  • CodeS-15B
  • StarCoder
  • StarCoderBase

Metrics

  • Accuracy
  • Valid efficiency score (VES)
  • Human evaluation (HE)

Datasets

  • Spider
  • BIRD
  • Spider-DK
  • Spider-Syn
  • Spider-Realistic
  • Dr.Spider
  • Bank-Financials
  • Aminer-Simplified
  • NL-SQL-458K

Benchmarks

  • Spider
  • BIRD
  • Dr.Spider