Small amounts of code in pre-training measurably boost general LLM abilities across many tasks

Overview

Decision SnapshotReady For Pilot

The paper runs many large-scale, controlled ablations across model sizes and datasets, so the experimental evidence is strong for the reported regimes, but synthetic data is proprietary and largest-scale limits are noted.

Citations3

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

Links

Abstract / PDF / Data

Why It Matters For Business

Adding a modest fraction of code to pretraining reliably boosts reasoning and generation, while small high-quality code sets provide big returns — so invest in curated code sources and include code in the final data up-weighting.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper runs large controlled pretraining experiments (470M–2.8B models) to measure how adding code to pretraining data affects non-code tasks. Key findings: a balanced recipe (some code during pretraining, lower code during continual training, plus code in the final cooldown) gives the best overall natural-language results. Compared to a text-only baseline, their best variant yields +8.2% natural-language reasoning, +4.2% world-knowledge, +6.6% generative win-rate, and ~12× code performance on evaluated benchmarks. Small amounts (10%) of high-quality synthetic code give outsized gains (≈9% NL, ≈45% code). Including code in the final cooldown stage gives additional lifts (≈3.6% NL, ≈10.1%

Problem Statement

Practitioners often include code in pretraining mixes, but we lack a systematic, large-scale study of how code affects non-code tasks. The paper asks: how much code, what kind, and at which training stage improves generalization beyond code generation?

Main Contribution

Large controlled ablation suite (64 pretraining runs) studying where and how code helps across NL reasoning, world knowledge, code, and generative quality.

Quantified optimal code proportion for non-code tasks (≈25% code) and documented failure when code dominates.

Key Findings

Adding code to pretraining improves non-code tasks versus text-only.

NumbersBalanced→text: +8.2% NL reasoning; +4.2% world knowledge; +6.6% win-rate; ~12× code

Practical UseInclude code in pretraining mixes to lift reasoning and generation; expect large gains in code ability too.

Evidence RefAbstract; Section 3.6; Table 2

A small share of high-quality synthetic code has outsized impact.

Numberscode+synth (10% synth) → +9% NL; +44.9% code vs web-code-only

Practical UseInvest in a small set of verified, high-quality synthetic code samples — they are cost-efficient for big gains.

Evidence RefSection 3.4, Figure 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	+8.2% (relative) for balanced→text vs text-only	text-only pretraining	+8.2% (relative)	NL reasoning suite (11 tasks)	Section 3.6; Abstract	Table 2; Sections 3.1, 3.6
Accuracy	+4.2% (relative) for balanced→text vs text-only	text-only pretraining	+4.2% (relative)	TriviaQA, NaturalQuestionsOpen	Section 3.6; Abstract	Table 2; Section 3.1

What To Try In 7 Days

If you train models: add ~20–25% code tokens to pretraining mixes and evaluate NL tasks vs text-only.

Run a short cooldown that includes code (even 10–20%) and measure win-rate and reasoning gains.

Create a small high-quality synthetic code corpus (eg. verified Python problems) and test its transfer in continual pretraining.

Optimization Features

Infra Optimization

TPU v5e used for training (infrastructure detail)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

SlimPajama (public)The Stack (public)

Risks & Boundaries

Limitations

Safety impacts of adding code to pretraining are not studied.

Synthetic high-quality code dataset is proprietary, limiting exact reproducibility.

When Not To Use

If your goal is pure world-knowledge memorization and you cannot include text sources, avoid code-heavy pretraining (>75% code).

If you cannot verify licensing/quality of code sources, adding noisy code may harm downstream code and NL tasks.

Failure Modes

Too much code (>75%) collapses world knowledge and harms non-code tasks.

Higher model scale amplifies trade-offs: models may favor code tasks at expense of some NL tasks.

Core Entities

Models

decoder-only Transformer (470M parameters)decoder-only Transformer (2.8B parameters)balanced-only (50% code / 50% text)balanced→text (balanced init, then text continual)code→text (code init, then text continual)

Metrics

Accuracyexact matchpass@1generative win-rate

Datasets

SlimPajama (text pretraining)The Stack (web-based code)Synthetic verified Python code (proprietary, 3.2B tokens)Code-adjacent data (commits, notebooks, StackExchange)Dolly-200 (judging prompts)HumanEval (Python)MBPPNaturalQuestionsOpenTriviaQA

Benchmarks

Natural language reasoning suite (11 benchmarks)World knowledge (NaturalQuestionsOpen, TriviaQA)Code generation (HumanEval, MBPP)Generative win-rates (LLM-as-a-judge on Dolly-200)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adding code to pretraining improves non-code tasks versus text-only.

A small share of high-quality synthetic code has outsized impact.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding