Small amounts of code in pre-training measurably boost general LLM abilities across many tasks

August 20, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

3

Authors

Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

Links

Abstract / PDF

Why It Matters For Business

Adding a modest fraction of code to pretraining reliably boosts reasoning and generation, while small high-quality code sets provide big returns — so invest in curated code sources and include code in the final data up-weighting.

Summary TLDR

This paper runs large controlled pretraining experiments (470M–2.8B models) to measure how adding code to pretraining data affects non-code tasks. Key findings: a balanced recipe (some code during pretraining, lower code during continual training, plus code in the final cooldown) gives the best overall natural-language results. Compared to a text-only baseline, their best variant yields +8.2% natural-language reasoning, +4.2% world-knowledge, +6.6% generative win-rate, and ~12× code performance on evaluated benchmarks. Small amounts (10%) of high-quality synthetic code give outsized gains (≈9% NL, ≈45% code). Including code in the final cooldown stage gives additional lifts (≈3.6% NL, ≈10.1%

Problem Statement

Practitioners often include code in pretraining mixes, but we lack a systematic, large-scale study of how code affects non-code tasks. The paper asks: how much code, what kind, and at which training stage improves generalization beyond code generation?

Main Contribution

Large controlled ablation suite (64 pretraining runs) studying where and how code helps across NL reasoning, world knowledge, code, and generative quality.

Quantified optimal code proportion for non-code tasks (≈25% code) and documented failure when code dominates.

Showed small amounts of high-quality synthetic code and including code in cooldown deliver outsized gains.

Provided practical pretraining recipes (balanced→text + cooldown with code) for best overall natural-language performance.

Key Findings

Adding code to pretraining improves non-code tasks versus text-only.

NumbersBalanced→text: +8.2% NL reasoning; +4.2% world knowledge; +6.6% win-rate; ~12× code

A small share of high-quality synthetic code has outsized impact.

Numberscode+synth (10% synth) → +9% NL; +44.9% code vs web-code-only

Including code in cooldown (final up-weighting) further improves results.

NumbersCooldown w/ code → +3.6% NL; +10.1% world knowledge; +20% code vs pre-cooldown

Too much code in pretraining hurts world knowledge and can topple NL performance.

Numbers100% code → up to −86.1% world knowledge; NL drops at high code shares

Scale preserves trends but increases trade-offs for code vs NL tasks.

Numbers2.8B models triple many gains vs 470M; code generation trade-off grows with scale

Results

Accuracy

Value+8.2% (relative) for balanced→text vs text-only

Baselinetext-only pretraining

Accuracy

Value+4.2% (relative) for balanced→text vs text-only

Baselinetext-only pretraining

Generative quality (LLM-as-a-judge win-rate)

Value+6.6% (relative) better vs text-only for code-including variants; best cooldown win-rate 52.3%

Baselinetext-only pretraining (no cooldown)

Code generation (pass@1 avg on HumanEval/MBPP)

Value~12× increase for best code-including full recipe vs text-only

Baselinetext-only pretraining

Synthetic code effect

Valuecode+synth (10% synth) → +9% NL; +44.9% code vs web-code-only

Baselineweb-code-only pretraining

Code proportion sensitivity

Value25% code is optimal for combined NL+world benchmarks; performance collapses at 100% code

Baseline0% and other code shares

Who Should Care

What To Try In 7 Days

If you train models: add ~20–25% code tokens to pretraining mixes and evaluate NL tasks vs text-only.

Run a short cooldown that includes code (even 10–20%) and measure win-rate and reasoning gains.

Create a small high-quality synthetic code corpus (eg. verified Python problems) and test its transfer in continual pretraining.

Optimization Features

Infra Optimization

  • TPU v5e used for training (infrastructure detail)

Reproducibility

Data Urls

  • SlimPajama (public)
  • The Stack (public)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Safety impacts of adding code to pretraining are not studied.
  • Synthetic high-quality code dataset is proprietary, limiting exact reproducibility.
  • Largest model scale studied is 2.8B; behavior at much larger sizes is inferred but not proven.

When Not To Use

  • If your goal is pure world-knowledge memorization and you cannot include text sources, avoid code-heavy pretraining (>75% code).
  • If you cannot verify licensing/quality of code sources, adding noisy code may harm downstream code and NL tasks.

Failure Modes

  • Too much code (>75%) collapses world knowledge and harms non-code tasks.
  • Higher model scale amplifies trade-offs: models may favor code tasks at expense of some NL tasks.
  • Proprietary or low-quality code can introduce noise and degrade performance.

Core Entities

Models

  • decoder-only Transformer (470M parameters)
  • decoder-only Transformer (2.8B parameters)
  • balanced-only (50% code / 50% text)
  • balanced→text (balanced init, then text continual)
  • code→text (code init, then text continual)

Metrics

  • Accuracy
  • exact match
  • pass@1
  • generative win-rate

Datasets

  • SlimPajama (text pretraining)
  • The Stack (web-based code)
  • Synthetic verified Python code (proprietary, 3.2B tokens)
  • Code-adjacent data (commits, notebooks, StackExchange)
  • Dolly-200 (judging prompts)
  • HumanEval (Python)
  • MBPP
  • NaturalQuestionsOpen
  • TriviaQA

Benchmarks

  • Natural language reasoning suite (11 benchmarks)
  • World knowledge (NaturalQuestionsOpen, TriviaQA)
  • Code generation (HumanEval, MBPP)
  • Generative win-rates (LLM-as-a-judge on Dolly-200)