Overview
DeMix demonstrates reliable merged proxies and clear cost savings in their controlled experiments; results are strongest for late-stage pretraining and when components share a common base.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
DeMix lets teams find better pretraining mixtures with far less compute, lowering infrastructure cost and speeding iteration on model capability trade-offs.
Who Should Care
Summary TLDR
DeMix trains a small set of large "component" models (one per candidate dataset) and then builds unlimited cheap proxies by weighted linear merging of their weights. These merged proxies let you search many data-mixture ratios without retraining. In experiments, DeMix matches or beats training-based proxy search with far less compute (e.g., macro Spearman ρ ≈ 0.81 using 30B-token components vs much larger budgets for training-based proxies) and yields a final pretraining mixture that ranks best on a multi-domain benchmark suite. The authors also release DeMix Corpora (≈22T tokens) and code.
Problem Statement
Choosing the right mix of pretraining data is crucial but costly: training many proxy models (one per candidate mix) is computationally expensive and small cheap proxies can mislead mixture search. The paper asks: can we predict how a model trained on any continuous mixture will behave without training that model?
Main Contribution
DeMix: a workflow that trains per-dataset component models and uses weighted model merging to synthesize unlimited training-free proxy models for data-mixture search.
Empirical evidence that merged proxies produce higher ranking consistency with large reference models than conventional small-scale training proxies, enabling cheaper and more reliable mixture search.
Key Findings
Merged proxies give strong ranking agreement with large reference models.
DeMix cuts search training cost by multiple× for similar proxy fidelity.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Proxy ranking consistency (macro Spearman's ρ) | 0.81 (DeMix with 30B-token components) | 0.53 (training-based proxies at comparable budget) | +0.28 | 96 reference mixtures; general+code+math macro | Table 2; Sec.4.1 | Table 2 |
| Capability Recovery Rate (merged vs reference) | 0.845 (linear merging) | 0.77–0.85 range for other merging methods | best in ablation | 96 mixtures; mean macro | Table 4; Sec.4.3.1 | Table 4 |
What To Try In 7 Days
Train a small set of component models (one per candidate dataset) from a shared base and merge them using linear weighted averaging to build cheap proxies.
Run the LightGBM predictor on merged-proxy evaluations to shortlist top mixture ratios, then train one final model on the selected mixture.
Test the effect of including ~50% general data in each component to preserve broad capabilities while specializing on domain data.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Model merging relies on small parameter updates; breaks if component updates are large.
All component models must share the same architecture and base initialization.
When Not To Use
You cannot train component models at meaningful scale (e.g., no resources to train the component models).
Candidate datasets are extremely heterogeneous or require different architectures.
Failure Modes
Merged proxy misranking for rare-domain mixtures or where updates are not small.
Insufficient general-data mixing in components leads to degraded proxy fidelity.

