Practical training recipes for MoE LLMs: when to upcycle, and how to diversify experts

June 3, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

3

Authors

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

Links

Abstract / PDF

Why It Matters For Business

MoE models can cut compute per-token while keeping high capacity; pick upcycling when you have a good dense checkpoint and limited budget, otherwise invest in from-scratch MoE to maximize final quality.

Summary TLDR

This report studies how to train large Mixture-of-Experts (MoE) language models in practice. Key takeaways: (1) whether to upcycle a dense checkpoint or train MoE from scratch depends on your MoE training budget relative to the dense pretraining cost (rule: if MoE budget ≥ 2× dense cost, train from scratch); (2) gating logit normalization (normalize gate logits before softmax, scale by λ) sharpens routing, lowers training loss and token drop; (3) adaptive per-layer auxiliary-loss coefficients (adjust by observed token drop rate) keeps load balanced without over-regularizing. The authors upcycled Skywork-13B to Skywork-MoE (146B params, 16 experts, 22B activated params) and report competitive

Problem Statement

Training MoE models raises three practical problems: (1) should you upcycle a dense checkpoint or train MoE from scratch given limited GPU budget? (2) how to make the gating/router actually route tokens to diverse experts? (3) how to tune the auxiliary load-balancing loss per layer so it helps rather than hurts next-token training?

Main Contribution

Empirical guidance on upcycling vs training-from-scratch with simple cost rules of thumb tied to training token budgets

Gating logit normalization: normalize gate logits before softmax and scale by λ to sharpen routing and reduce token drop

Adaptive auxiliary-loss coefficients: per-layer, token-drop-driven update rule to tune load-balance regularization

Built and evaluated Skywork-MoE (146B params, 16 experts) upcycled from Skywork-13B and compared it to open models

Engineering: Expert Data Parallel (EDP) and non-uniform pipeline splits to improve MoE training efficiency

Key Findings

Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.

Numbers100B tokens ≈ 2/3 C, 300B tokens ≈ 2 C; scratch outperforms upcycled at 300B budget

Gating logit normalization sharpens router outputs and lowers training loss and token drop rates in experiments.

NumbersTest on 2.5B MoE (16 experts): normalization reduced training loss and token drop (Fig.2–3); λ=1 chosen

Adaptive per-layer auxiliary coefficients track token drop and keep load-regularization responsive.

NumbersUsed ξ=1/5, α_max=0.01, β=0.99 and maintained desirable token-drop and coefficient curves (Fig.4)

Skywork-MoE (146B, 16 experts, 22B activated params) is competitive on standard benchmarks.

NumbersCEVAL 82.2, CMMLU 79.5, MMLU 77.4, GSM8K 76.1, MATH 31.9, HumanEval 43.9 (Table 1)

Two investigated 'fixes' gave little or mixed benefit: expert-layer learning-rate scaling and pretraining expert-specialized checkpoints.

NumbersExpert specialization gave <0.01 loss advantage after 90B tokens; LR-scaling curves converged by 310B (C.1–C.2)

Results

CEVAL

Value82.2

BaselineQwen1.5-72B 84.1

CMMLU

Value79.5

BaselineDeepseek-V2 84.0

MMLU

Value77.4

BaselineQwen1.5-72B 77.5

GSM8K

Value76.1

BaselineDeepseek-V2 79.2

MATH

Value31.9

BaselineDeepseek-V2 43.6

HumanEval

Value43.9

BaselineDBRX-Instruct 70.1

Who Should Care

What To Try In 7 Days

If you have a dense checkpoint, run a short upcycle experiment and a matched small from-scratch run to compare validation loss under your token budget.

Add gating logit normalization in your MoE router: normalize logits, set λ=1, and track Max1/Max2 and token-drop.

Implement per-layer adaptive auxiliary α: monitor token-drop and update α with f(d)=ξd (start ξ=0.2, α_max=0.01, β≈0.99).

Agent Features

Frameworks

  • SkyworkMegatron
  • Megatron-LM

Architectures

  • MoE
  • Transformer

Optimization Features

Token Efficiency

  • activated parameters reduce per-token compute (22B activated of 146B)

Infra Optimization

  • 12-way pipeline, 4-way tensor-expert parallelism, 32-way data parallelism, ZeRO-1

Model Optimization

  • MoE
  • gating logit normalization

System Optimization

  • communication reduction for expert parallelism
  • overlap comm with computation

Training Optimization

  • upcycling from dense checkpoints
  • adaptive auxiliary loss coefficients
  • Expert Data Parallel (EDP)
  • non-uniform pipeline parallelism
  • kernel fusion and overlapped communication

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments rely on internal datasets and a SkyPile subset; public reproducibility details are limited.
  • Some experiments are small-scale; effects may vary when scaling or on different data mixes.
  • Gating normalization and hyperparameters (λ, ξ, α_max) may need retuning per model and dataset.
  • Benchmark comparisons cover selected open models; not a full sweep against all large MoEs.

When Not To Use

  • Do not upcycle if your MoE budget is >= 2× the dense pretraining cost—train from scratch instead.
  • Avoid assuming λ=1 is universally optimal; tune before wide deployment.
  • Avoid investing in expert-specialized pretraining unless you can justify the extra compute with clear gains.

Failure Modes

  • Gate entropy remains high despite normalization if λ mis-set, causing near-uniform expert averaging and poor specialization.
  • Adaptive auxiliary loss can over-regularize and hurt next-token accuracy if tied too tightly to token-drop without safeguards.
  • Large communication overheads or suboptimal parallel configs can reduce GPU utilization and increase wall time.

Core Entities

Models

  • Skywork-MoE
  • Skywork-13B
  • Deepseek-V2
  • Deepseek-67B
  • Qwen1.5-72B
  • Llama2-70B
  • Llama3-70B
  • Mixtral 8*7B
  • Mixtral 8*22B

Metrics

  • CEVAL score
  • CMMLU score
  • MMLU score
  • Accuracy
  • MATH score
  • HumanEval pass@k-style score
  • training loss
  • token drop rate
  • Max1/Max2 gate ratio
  • Model Floating-point Utilization (MFU)
  • throughput (tokens/GPU/s)

Datasets

  • SkyPile (subset)
  • CEVAL
  • CMMLU
  • MMLU
  • GSM8K
  • MATH
  • HumanEval

Benchmarks

  • CEVAL
  • CMMLU
  • MMLU
  • GSM8K
  • MATH
  • HumanEval

Context Entities

Models

  • Switch Transformer
  • GLaM/GLAM
  • Deepseek-V1

Metrics

  • same as core

Datasets

  • Skywork-13B pretraining data (3.2T + extra 2T tokens internal)

Benchmarks

  • same as core