Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
DeMix lets teams find better pretraining mixtures with far less compute, lowering infrastructure cost and speeding iteration on model capability trade-offs.
Summary TLDR
DeMix trains a small set of large "component" models (one per candidate dataset) and then builds unlimited cheap proxies by weighted linear merging of their weights. These merged proxies let you search many data-mixture ratios without retraining. In experiments, DeMix matches or beats training-based proxy search with far less compute (e.g., macro Spearman ρ ≈ 0.81 using 30B-token components vs much larger budgets for training-based proxies) and yields a final pretraining mixture that ranks best on a multi-domain benchmark suite. The authors also release DeMix Corpora (≈22T tokens) and code.
Problem Statement
Choosing the right mix of pretraining data is crucial but costly: training many proxy models (one per candidate mix) is computationally expensive and small cheap proxies can mislead mixture search. The paper asks: can we predict how a model trained on any continuous mixture will behave without training that model?
Main Contribution
DeMix: a workflow that trains per-dataset component models and uses weighted model merging to synthesize unlimited training-free proxy models for data-mixture search.
Empirical evidence that merged proxies produce higher ranking consistency with large reference models than conventional small-scale training proxies, enabling cheaper and more reliable mixture search.
DeMix Corpora: a curated 22T-token pretraining dataset with validated mixture ratios, plus code to reproduce the pipeline.
Key Findings
Merged proxies give strong ranking agreement with large reference models.
DeMix cuts search training cost by multiple× for similar proxy fidelity.
Linear weighted merging is simple and effective.
Including general-domain data in component training is important.
DeMix yields better final mixed pretraining data in multi-domain tests.
Results
Proxy ranking consistency (macro Spearman's ρ)
Capability Recovery Rate (merged vs reference)
Final multi-domain average rank (lower better)
Compute budget for similar proxy quality
Who Should Care
What To Try In 7 Days
Train a small set of component models (one per candidate dataset) from a shared base and merge them using linear weighted averaging to build cheap proxies.
Run the LightGBM predictor on merged-proxy evaluations to shortlist top mixture ratios, then train one final model on the selected mixture.
Test the effect of including ~50% general data in each component to preserve broad capabilities while specializing on domain data.
Optimization Features
Token Efficiency
- search without extra training token cost
Infra Optimization
- fewer large-scale training runs; less GPU-hour consumption
Model Optimization
- weighted model merging to synthesize proxies
- linear averaging of weight deltas
Training Optimization
- decouple mixture search from proxy training
- train component models once and reuse
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Model merging relies on small parameter updates; breaks if component updates are large.
- All component models must share the same architecture and base initialization.
- Search is limited to the span of provided candidate datasets; you cannot discover novel data sources this way.
- Too many proxies or overly large proxy pools can add noise and cause overfitting.
When Not To Use
- You cannot train component models at meaningful scale (e.g., no resources to train the component models).
- Candidate datasets are extremely heterogeneous or require different architectures.
- Early-stage pretraining where the model lacks base capabilities to absorb domain signals.
Failure Modes
- Merged proxy misranking for rare-domain mixtures or where updates are not small.
- Insufficient general-data mixing in components leads to degraded proxy fidelity.
- Excessive proxy count can decrease final mixture quality via noise/overfitting.
Core Entities
Models
- Qwen3-1.7B
- Qwen3-0.6B
- Qwen3-235B-A22B
Metrics
- Spearman's ρ
- Capability Recovery Rate
- Benchmark score
- Average rank
Datasets
- DeMix Corpora
- FineWeb-Edu
- DCLM-baseline
- DOLMA-v1.7
- Nemotron-Pretrain
- SmolLM-Corpus
Benchmarks
- ARC-E
- HellaSwag
- WinoGrande
- PIQA
- SIQA
- HumanEval
- MBPP
- GSM8K
- MATH
- OpenCompass
Context Entities
Models
- Merge methods: Multi-SLERP, Breadcrumbs, DARE, DELLA, TIES
Metrics
- Top-25% Spearman's ρ
Datasets
- FineMath, MegaMath, OpenCoder
Benchmarks
- OpenCompass evaluation suite

