Find better pretraining data mixes cheaply by merging component models instead of training many proxies

January 31, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

Links

Abstract / PDF

Why It Matters For Business

DeMix lets teams find better pretraining mixtures with far less compute, lowering infrastructure cost and speeding iteration on model capability trade-offs.

Summary TLDR

DeMix trains a small set of large "component" models (one per candidate dataset) and then builds unlimited cheap proxies by weighted linear merging of their weights. These merged proxies let you search many data-mixture ratios without retraining. In experiments, DeMix matches or beats training-based proxy search with far less compute (e.g., macro Spearman ρ ≈ 0.81 using 30B-token components vs much larger budgets for training-based proxies) and yields a final pretraining mixture that ranks best on a multi-domain benchmark suite. The authors also release DeMix Corpora (≈22T tokens) and code.

Problem Statement

Choosing the right mix of pretraining data is crucial but costly: training many proxy models (one per candidate mix) is computationally expensive and small cheap proxies can mislead mixture search. The paper asks: can we predict how a model trained on any continuous mixture will behave without training that model?

Main Contribution

DeMix: a workflow that trains per-dataset component models and uses weighted model merging to synthesize unlimited training-free proxy models for data-mixture search.

Empirical evidence that merged proxies produce higher ranking consistency with large reference models than conventional small-scale training proxies, enabling cheaper and more reliable mixture search.

DeMix Corpora: a curated 22T-token pretraining dataset with validated mixture ratios, plus code to reproduce the pipeline.

Key Findings

Merged proxies give strong ranking agreement with large reference models.

Numbersmacro Spearman ρ = 0.81 (DeMix with 30B-token components) vs 0.53 (training-based proxy at similar budget)

DeMix cuts search training cost by multiple× for similar proxy fidelity.

Numbersrequires ≈211B total tokens vs ≈1344B for similar accuracy (≈6.4× reduction)

Linear weighted merging is simple and effective.

NumbersLinear merging: ρ = 0.787, capability recovery = 0.845

Including general-domain data in component training is important.

Numbers50% general yields ρ=0.787 vs 0% general yields ρ=0.652

DeMix yields better final mixed pretraining data in multi-domain tests.

NumbersBest average rank = 24.00 across general, code, math benchmarks (DeMix row)

Results

Proxy ranking consistency (macro Spearman's ρ)

Value0.81 (DeMix with 30B-token components)

Baseline0.53 (training-based proxies at comparable budget)

Capability Recovery Rate (merged vs reference)

Value0.845 (linear merging)

Baseline0.77–0.85 range for other merging methods

Final multi-domain average rank (lower better)

Value24.00 (DeMix, 224 merged proxies)

Baseline29.33–36.00 (best RegMix/CLIMB/heuristic rows shown)

Compute budget for similar proxy quality

Value≈211–212B tokens (DeMix experiment)

Baseline≈1344B tokens (training-based achieving similar ρ)

Who Should Care

What To Try In 7 Days

Train a small set of component models (one per candidate dataset) from a shared base and merge them using linear weighted averaging to build cheap proxies.

Run the LightGBM predictor on merged-proxy evaluations to shortlist top mixture ratios, then train one final model on the selected mixture.

Test the effect of including ~50% general data in each component to preserve broad capabilities while specializing on domain data.

Optimization Features

Token Efficiency

  • search without extra training token cost

Infra Optimization

  • fewer large-scale training runs; less GPU-hour consumption

Model Optimization

  • weighted model merging to synthesize proxies
  • linear averaging of weight deltas

Training Optimization

  • decouple mixture search from proxy training
  • train component models once and reuse

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Model merging relies on small parameter updates; breaks if component updates are large.
  • All component models must share the same architecture and base initialization.
  • Search is limited to the span of provided candidate datasets; you cannot discover novel data sources this way.
  • Too many proxies or overly large proxy pools can add noise and cause overfitting.

When Not To Use

  • You cannot train component models at meaningful scale (e.g., no resources to train the component models).
  • Candidate datasets are extremely heterogeneous or require different architectures.
  • Early-stage pretraining where the model lacks base capabilities to absorb domain signals.

Failure Modes

  • Merged proxy misranking for rare-domain mixtures or where updates are not small.
  • Insufficient general-data mixing in components leads to degraded proxy fidelity.
  • Excessive proxy count can decrease final mixture quality via noise/overfitting.

Core Entities

Models

  • Qwen3-1.7B
  • Qwen3-0.6B
  • Qwen3-235B-A22B

Metrics

  • Spearman's ρ
  • Capability Recovery Rate
  • Benchmark score
  • Average rank

Datasets

  • DeMix Corpora
  • FineWeb-Edu
  • DCLM-baseline
  • DOLMA-v1.7
  • Nemotron-Pretrain
  • SmolLM-Corpus

Benchmarks

  • ARC-E
  • HellaSwag
  • WinoGrande
  • PIQA
  • SIQA
  • HumanEval
  • MBPP
  • GSM8K
  • MATH
  • OpenCompass

Context Entities

Models

  • Merge methods: Multi-SLERP, Breadcrumbs, DARE, DELLA, TIES

Metrics

  • Top-25% Spearman's ρ

Datasets

  • FineMath, MegaMath, OpenCoder

Benchmarks

  • OpenCompass evaluation suite