Overview
The paper runs many large-scale, controlled ablations across model sizes and datasets, so the experimental evidence is strong for the reported regimes, but synthetic data is proprietary and largest-scale limits are noted.
Citations3
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Adding a modest fraction of code to pretraining reliably boosts reasoning and generation, while small high-quality code sets provide big returns — so invest in curated code sources and include code in the final data up-weighting.
Who Should Care
Summary TLDR
This paper runs large controlled pretraining experiments (470M–2.8B models) to measure how adding code to pretraining data affects non-code tasks. Key findings: a balanced recipe (some code during pretraining, lower code during continual training, plus code in the final cooldown) gives the best overall natural-language results. Compared to a text-only baseline, their best variant yields +8.2% natural-language reasoning, +4.2% world-knowledge, +6.6% generative win-rate, and ~12× code performance on evaluated benchmarks. Small amounts (10%) of high-quality synthetic code give outsized gains (≈9% NL, ≈45% code). Including code in the final cooldown stage gives additional lifts (≈3.6% NL, ≈10.1%
Problem Statement
Practitioners often include code in pretraining mixes, but we lack a systematic, large-scale study of how code affects non-code tasks. The paper asks: how much code, what kind, and at which training stage improves generalization beyond code generation?
Main Contribution
Large controlled ablation suite (64 pretraining runs) studying where and how code helps across NL reasoning, world knowledge, code, and generative quality.
Quantified optimal code proportion for non-code tasks (≈25% code) and documented failure when code dominates.
Key Findings
Adding code to pretraining improves non-code tasks versus text-only.
A small share of high-quality synthetic code has outsized impact.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | +8.2% (relative) for balanced→text vs text-only | text-only pretraining | +8.2% (relative) | NL reasoning suite (11 tasks) | Section 3.6; Abstract | Table 2; Sections 3.1, 3.6 |
| Accuracy | +4.2% (relative) for balanced→text vs text-only | text-only pretraining | +4.2% (relative) | TriviaQA, NaturalQuestionsOpen | Section 3.6; Abstract | Table 2; Section 3.1 |
What To Try In 7 Days
If you train models: add ~20–25% code tokens to pretraining mixes and evaluate NL tasks vs text-only.
Run a short cooldown that includes code (even 10–20%) and measure win-rate and reasoning gains.
Create a small high-quality synthetic code corpus (eg. verified Python problems) and test its transfer in continual pretraining.
Optimization Features
Infra Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Safety impacts of adding code to pretraining are not studied.
Synthetic high-quality code dataset is proprietary, limiting exact reproducibility.
When Not To Use
If your goal is pure world-knowledge memorization and you cannot include text sources, avoid code-heavy pretraining (>75% code).
If you cannot verify licensing/quality of code sources, adding noisy code may harm downstream code and NL tasks.
Failure Modes
Too much code (>75%) collapses world knowledge and harms non-code tasks.
Higher model scale amplifies trade-offs: models may favor code tasks at expense of some NL tasks.

