Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
Adding a modest fraction of code to pretraining reliably boosts reasoning and generation, while small high-quality code sets provide big returns — so invest in curated code sources and include code in the final data up-weighting.
Summary TLDR
This paper runs large controlled pretraining experiments (470M–2.8B models) to measure how adding code to pretraining data affects non-code tasks. Key findings: a balanced recipe (some code during pretraining, lower code during continual training, plus code in the final cooldown) gives the best overall natural-language results. Compared to a text-only baseline, their best variant yields +8.2% natural-language reasoning, +4.2% world-knowledge, +6.6% generative win-rate, and ~12× code performance on evaluated benchmarks. Small amounts (10%) of high-quality synthetic code give outsized gains (≈9% NL, ≈45% code). Including code in the final cooldown stage gives additional lifts (≈3.6% NL, ≈10.1%
Problem Statement
Practitioners often include code in pretraining mixes, but we lack a systematic, large-scale study of how code affects non-code tasks. The paper asks: how much code, what kind, and at which training stage improves generalization beyond code generation?
Main Contribution
Large controlled ablation suite (64 pretraining runs) studying where and how code helps across NL reasoning, world knowledge, code, and generative quality.
Quantified optimal code proportion for non-code tasks (≈25% code) and documented failure when code dominates.
Showed small amounts of high-quality synthetic code and including code in cooldown deliver outsized gains.
Provided practical pretraining recipes (balanced→text + cooldown with code) for best overall natural-language performance.
Key Findings
Adding code to pretraining improves non-code tasks versus text-only.
A small share of high-quality synthetic code has outsized impact.
Including code in cooldown (final up-weighting) further improves results.
Too much code in pretraining hurts world knowledge and can topple NL performance.
Scale preserves trends but increases trade-offs for code vs NL tasks.
Results
Accuracy
Accuracy
Generative quality (LLM-as-a-judge win-rate)
Code generation (pass@1 avg on HumanEval/MBPP)
Synthetic code effect
Code proportion sensitivity
Who Should Care
What To Try In 7 Days
If you train models: add ~20–25% code tokens to pretraining mixes and evaluate NL tasks vs text-only.
Run a short cooldown that includes code (even 10–20%) and measure win-rate and reasoning gains.
Create a small high-quality synthetic code corpus (eg. verified Python problems) and test its transfer in continual pretraining.
Optimization Features
Infra Optimization
- TPU v5e used for training (infrastructure detail)
Reproducibility
Data Urls
- SlimPajama (public)
- The Stack (public)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Safety impacts of adding code to pretraining are not studied.
- Synthetic high-quality code dataset is proprietary, limiting exact reproducibility.
- Largest model scale studied is 2.8B; behavior at much larger sizes is inferred but not proven.
When Not To Use
- If your goal is pure world-knowledge memorization and you cannot include text sources, avoid code-heavy pretraining (>75% code).
- If you cannot verify licensing/quality of code sources, adding noisy code may harm downstream code and NL tasks.
Failure Modes
- Too much code (>75%) collapses world knowledge and harms non-code tasks.
- Higher model scale amplifies trade-offs: models may favor code tasks at expense of some NL tasks.
- Proprietary or low-quality code can introduce noise and degrade performance.
Core Entities
Models
- decoder-only Transformer (470M parameters)
- decoder-only Transformer (2.8B parameters)
- balanced-only (50% code / 50% text)
- balanced→text (balanced init, then text continual)
- code→text (code init, then text continual)
Metrics
- Accuracy
- exact match
- pass@1
- generative win-rate
Datasets
- SlimPajama (text pretraining)
- The Stack (web-based code)
- Synthetic verified Python code (proprietary, 3.2B tokens)
- Code-adjacent data (commits, notebooks, StackExchange)
- Dolly-200 (judging prompts)
- HumanEval (Python)
- MBPP
- NaturalQuestionsOpen
- TriviaQA
Benchmarks
- Natural language reasoning suite (11 benchmarks)
- World knowledge (NaturalQuestionsOpen, TriviaQA)
- Code generation (HumanEval, MBPP)
- Generative win-rates (LLM-as-a-judge on Dolly-200)

