Find better pretraining data mixes cheaply by merging component models instead of training many proxies

Overview

Decision SnapshotReady For Pilot

DeMix demonstrates reliable merged proxies and clear cost savings in their controlled experiments; results are strongest for late-stage pretraining and when components share a common base.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

DeMix lets teams find better pretraining mixtures with far less compute, lowering infrastructure cost and speeding iteration on model capability trade-offs.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist Founder

Summary TLDR

DeMix trains a small set of large "component" models (one per candidate dataset) and then builds unlimited cheap proxies by weighted linear merging of their weights. These merged proxies let you search many data-mixture ratios without retraining. In experiments, DeMix matches or beats training-based proxy search with far less compute (e.g., macro Spearman ρ ≈ 0.81 using 30B-token components vs much larger budgets for training-based proxies) and yields a final pretraining mixture that ranks best on a multi-domain benchmark suite. The authors also release DeMix Corpora (≈22T tokens) and code.

Problem Statement

Choosing the right mix of pretraining data is crucial but costly: training many proxy models (one per candidate mix) is computationally expensive and small cheap proxies can mislead mixture search. The paper asks: can we predict how a model trained on any continuous mixture will behave without training that model?

Main Contribution

DeMix: a workflow that trains per-dataset component models and uses weighted model merging to synthesize unlimited training-free proxy models for data-mixture search.

Empirical evidence that merged proxies produce higher ranking consistency with large reference models than conventional small-scale training proxies, enabling cheaper and more reliable mixture search.

Key Findings

Merged proxies give strong ranking agreement with large reference models.

Numbersmacro Spearman ρ = 0.81 (DeMix with 30B-token components) vs 0.53 (training-based proxy at similar budget)

Practical UseYou can rank candidate mixtures reliably without re-training each mix; use merged proxies to prioritize top mixes and save compute.

Evidence RefTable 2; Sec.4.1

DeMix cuts search training cost by multiple× for similar proxy fidelity.

Numbersrequires ≈211B total tokens vs ≈1344B for similar accuracy (≈6.4× reduction)

Practical UseExpect large compute savings when tuning late-stage pretraining mixes; reinvest saved budget into more search trials or larger component models.

Evidence RefIntro; Table 2; Sec.4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Proxy ranking consistency (macro Spearman's ρ)	0.81 (DeMix with 30B-token components)	0.53 (training-based proxies at comparable budget)	+0.28	96 reference mixtures; general+code+math macro	Table 2; Sec.4.1	Table 2
Capability Recovery Rate (merged vs reference)	0.845 (linear merging)	0.77–0.85 range for other merging methods	best in ablation	96 mixtures; mean macro	Table 4; Sec.4.3.1	Table 4

What To Try In 7 Days

Train a small set of component models (one per candidate dataset) from a shared base and merge them using linear weighted averaging to build cheap proxies.

Run the LightGBM predictor on merged-proxy evaluations to shortlist top mixture ratios, then train one final model on the selected mixture.

Test the effect of including ~50% general data in each component to preserve broad capabilities while specializing on domain data.

Optimization Features

Token Efficiency

search without extra training token cost

Infra Optimization

fewer large-scale training runs; less GPU-hour consumption

Model Optimization

weighted model merging to synthesize proxieslinear averaging of weight deltas

Training Optimization

decouple mixture search from proxy trainingtrain component models once and reuse

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/Lucius-lsr/DeMix

Data URLs

https://github.com/Lucius-lsr/DeMix

Risks & Boundaries

Limitations

Model merging relies on small parameter updates; breaks if component updates are large.

All component models must share the same architecture and base initialization.

When Not To Use

You cannot train component models at meaningful scale (e.g., no resources to train the component models).

Candidate datasets are extremely heterogeneous or require different architectures.

Failure Modes

Merged proxy misranking for rare-domain mixtures or where updates are not small.

Insufficient general-data mixing in components leads to degraded proxy fidelity.

Core Entities

Models

Qwen3-1.7BQwen3-0.6BQwen3-235B-A22B

Metrics

Spearman's ρCapability Recovery RateBenchmark scoreAverage rank

Datasets

DeMix CorporaFineWeb-EduDCLM-baselineDOLMA-v1.7Nemotron-PretrainSmolLM-Corpus

Benchmarks

ARC-EHellaSwagWinoGrandePIQASIQAHumanEvalMBPPGSM8KMATHOpenCompass

Context Entities

Models

Merge methods: Multi-SLERP, Breadcrumbs, DARE, DELLA, TIES

Metrics

Top-25% Spearman's ρ

Datasets

FineMath, MegaMath, OpenCoder

Benchmarks

OpenCompass evaluation suite

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Merged proxies give strong ranking agreement with large reference models.

DeMix cuts search training cost by multiple× for similar proxy fidelity.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding