Small amounts of code in pre-training measurably boost general LLM abilities across many tasks

August 20, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper runs many large-scale, controlled ablations across model sizes and datasets, so the experimental evidence is strong for the reported regimes, but synthetic data is proprietary and largest-scale limits are noted.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

Links

Abstract / PDF / Data

Why It Matters For Business

Adding a modest fraction of code to pretraining reliably boosts reasoning and generation, while small high-quality code sets provide big returns — so invest in curated code sources and include code in the final data up-weighting.

Who Should Care

Summary TLDR

This paper runs large controlled pretraining experiments (470M–2.8B models) to measure how adding code to pretraining data affects non-code tasks. Key findings: a balanced recipe (some code during pretraining, lower code during continual training, plus code in the final cooldown) gives the best overall natural-language results. Compared to a text-only baseline, their best variant yields +8.2% natural-language reasoning, +4.2% world-knowledge, +6.6% generative win-rate, and ~12× code performance on evaluated benchmarks. Small amounts (10%) of high-quality synthetic code give outsized gains (≈9% NL, ≈45% code). Including code in the final cooldown stage gives additional lifts (≈3.6% NL, ≈10.1%

Problem Statement

Practitioners often include code in pretraining mixes, but we lack a systematic, large-scale study of how code affects non-code tasks. The paper asks: how much code, what kind, and at which training stage improves generalization beyond code generation?

Main Contribution

Large controlled ablation suite (64 pretraining runs) studying where and how code helps across NL reasoning, world knowledge, code, and generative quality.

Quantified optimal code proportion for non-code tasks (≈25% code) and documented failure when code dominates.

Key Findings

Adding code to pretraining improves non-code tasks versus text-only.

NumbersBalanced→text: +8.2% NL reasoning; +4.2% world knowledge; +6.6% win-rate; ~12× code

Practical UseInclude code in pretraining mixes to lift reasoning and generation; expect large gains in code ability too.

Evidence RefAbstract; Section 3.6; Table 2

A small share of high-quality synthetic code has outsized impact.

Numberscode+synth (10% synth) → +9% NL; +44.9% code vs web-code-only

Practical UseInvest in a small set of verified, high-quality synthetic code samples — they are cost-efficient for big gains.

Evidence RefSection 3.4, Figure 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy+8.2% (relative) for balanced→text vs text-onlytext-only pretraining+8.2% (relative)NL reasoning suite (11 tasks)Section 3.6; AbstractTable 2; Sections 3.1, 3.6
Accuracy+4.2% (relative) for balanced→text vs text-onlytext-only pretraining+4.2% (relative)TriviaQA, NaturalQuestionsOpenSection 3.6; AbstractTable 2; Section 3.1

What To Try In 7 Days

If you train models: add ~20–25% code tokens to pretraining mixes and evaluate NL tasks vs text-only.

Run a short cooldown that includes code (even 10–20%) and measure win-rate and reasoning gains.

Create a small high-quality synthetic code corpus (eg. verified Python problems) and test its transfer in continual pretraining.

Optimization Features

Infra Optimization
TPU v5e used for training (infrastructure detail)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

SlimPajama (public)The Stack (public)

Risks & Boundaries

Limitations

Safety impacts of adding code to pretraining are not studied.

Synthetic high-quality code dataset is proprietary, limiting exact reproducibility.

When Not To Use

If your goal is pure world-knowledge memorization and you cannot include text sources, avoid code-heavy pretraining (>75% code).

If you cannot verify licensing/quality of code sources, adding noisy code may harm downstream code and NL tasks.

Failure Modes

Too much code (>75%) collapses world knowledge and harms non-code tasks.

Higher model scale amplifies trade-offs: models may favor code tasks at expense of some NL tasks.

Core Entities

Models

decoder-only Transformer (470M parameters)decoder-only Transformer (2.8B parameters)balanced-only (50% code / 50% text)balanced→text (balanced init, then text continual)code→text (code init, then text continual)

Metrics

Accuracyexact matchpass@1generative win-rate

Datasets

SlimPajama (text pretraining)The Stack (web-based code)Synthetic verified Python code (proprietary, 3.2B tokens)Code-adjacent data (commits, notebooks, StackExchange)Dolly-200 (judging prompts)HumanEval (Python)MBPPNaturalQuestionsOpenTriviaQA

Benchmarks

Natural language reasoning suite (11 benchmarks)World knowledge (NaturalQuestionsOpen, TriviaQA)Code generation (HumanEval, MBPP)Generative win-rates (LLM-as-a-judge on Dolly-200)