Find better pretraining data mixes cheaply by merging component models instead of training many proxies

January 31, 20268 min

Overview

Decision SnapshotReady For Pilot

DeMix demonstrates reliable merged proxies and clear cost savings in their controlled experiments; results are strongest for late-stage pretraining and when components share a common base.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DeMix lets teams find better pretraining mixtures with far less compute, lowering infrastructure cost and speeding iteration on model capability trade-offs.

Who Should Care

Summary TLDR

DeMix trains a small set of large "component" models (one per candidate dataset) and then builds unlimited cheap proxies by weighted linear merging of their weights. These merged proxies let you search many data-mixture ratios without retraining. In experiments, DeMix matches or beats training-based proxy search with far less compute (e.g., macro Spearman ρ ≈ 0.81 using 30B-token components vs much larger budgets for training-based proxies) and yields a final pretraining mixture that ranks best on a multi-domain benchmark suite. The authors also release DeMix Corpora (≈22T tokens) and code.

Problem Statement

Choosing the right mix of pretraining data is crucial but costly: training many proxy models (one per candidate mix) is computationally expensive and small cheap proxies can mislead mixture search. The paper asks: can we predict how a model trained on any continuous mixture will behave without training that model?

Main Contribution

DeMix: a workflow that trains per-dataset component models and uses weighted model merging to synthesize unlimited training-free proxy models for data-mixture search.

Empirical evidence that merged proxies produce higher ranking consistency with large reference models than conventional small-scale training proxies, enabling cheaper and more reliable mixture search.

Key Findings

Merged proxies give strong ranking agreement with large reference models.

Numbersmacro Spearman ρ = 0.81 (DeMix with 30B-token components) vs 0.53 (training-based proxy at similar budget)

Practical UseYou can rank candidate mixtures reliably without re-training each mix; use merged proxies to prioritize top mixes and save compute.

Evidence RefTable 2; Sec.4.1

DeMix cuts search training cost by multiple× for similar proxy fidelity.

Numbersrequires ≈211B total tokens vs ≈1344B for similar accuracy (≈6.4× reduction)

Practical UseExpect large compute savings when tuning late-stage pretraining mixes; reinvest saved budget into more search trials or larger component models.

Evidence RefIntro; Table 2; Sec.4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Proxy ranking consistency (macro Spearman's ρ)0.81 (DeMix with 30B-token components)0.53 (training-based proxies at comparable budget)+0.2896 reference mixtures; general+code+math macroTable 2; Sec.4.1Table 2
Capability Recovery Rate (merged vs reference)0.845 (linear merging)0.770.85 range for other merging methodsbest in ablation96 mixtures; mean macroTable 4; Sec.4.3.1Table 4

What To Try In 7 Days

Train a small set of component models (one per candidate dataset) from a shared base and merge them using linear weighted averaging to build cheap proxies.

Run the LightGBM predictor on merged-proxy evaluations to shortlist top mixture ratios, then train one final model on the selected mixture.

Test the effect of including ~50% general data in each component to preserve broad capabilities while specializing on domain data.

Optimization Features

Token Efficiency
search without extra training token cost
Infra Optimization
fewer large-scale training runs; less GPU-hour consumption
Model Optimization
weighted model merging to synthesize proxieslinear averaging of weight deltas
Training Optimization
decouple mixture search from proxy trainingtrain component models once and reuse

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Model merging relies on small parameter updates; breaks if component updates are large.

All component models must share the same architecture and base initialization.

When Not To Use

You cannot train component models at meaningful scale (e.g., no resources to train the component models).

Candidate datasets are extremely heterogeneous or require different architectures.

Failure Modes

Merged proxy misranking for rare-domain mixtures or where updates are not small.

Insufficient general-data mixing in components leads to degraded proxy fidelity.

Core Entities

Models

Qwen3-1.7BQwen3-0.6BQwen3-235B-A22B

Metrics

Spearman's ρCapability Recovery RateBenchmark scoreAverage rank

Datasets

DeMix CorporaFineWeb-EduDCLM-baselineDOLMA-v1.7Nemotron-PretrainSmolLM-Corpus

Benchmarks

ARC-EHellaSwagWinoGrandePIQASIQAHumanEvalMBPPGSM8KMATHOpenCompass

Context Entities

Models

Merge methods: Multi-SLERP, Breadcrumbs, DARE, DELLA, TIES

Metrics

Top-25% Spearman's ρ

Datasets

FineMath, MegaMath, OpenCoder

Benchmarks

OpenCompass evaluation suite