Practical training recipes for MoE LLMs: when to upcycle, and how to diversify experts

June 3, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper gives actionable, tested training knobs (upcycle rules, gate normalization, adaptive α) and reports a working 146B MoE; experiments are internal and some ablations are small-scale, so apply but validate on your data and budget.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

Links

Abstract / PDF

Why It Matters For Business

MoE models can cut compute per-token while keeping high capacity; pick upcycling when you have a good dense checkpoint and limited budget, otherwise invest in from-scratch MoE to maximize final quality.

Who Should Care

Summary TLDR

This report studies how to train large Mixture-of-Experts (MoE) language models in practice. Key takeaways: (1) whether to upcycle a dense checkpoint or train MoE from scratch depends on your MoE training budget relative to the dense pretraining cost (rule: if MoE budget ≥ 2× dense cost, train from scratch); (2) gating logit normalization (normalize gate logits before softmax, scale by λ) sharpens routing, lowers training loss and token drop; (3) adaptive per-layer auxiliary-loss coefficients (adjust by observed token drop rate) keeps load balanced without over-regularizing. The authors upcycled Skywork-13B to Skywork-MoE (146B params, 16 experts, 22B activated params) and report competitive

Problem Statement

Training MoE models raises three practical problems: (1) should you upcycle a dense checkpoint or train MoE from scratch given limited GPU budget? (2) how to make the gating/router actually route tokens to diverse experts? (3) how to tune the auxiliary load-balancing loss per layer so it helps rather than hurts next-token training?

Main Contribution

Empirical guidance on upcycling vs training-from-scratch with simple cost rules of thumb tied to training token budgets

Gating logit normalization: normalize gate logits before softmax and scale by λ to sharpen routing and reduce token drop

Key Findings

Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.

Numbers100B tokens ≈ 2/3 C, 300B tokens ≈ 2 C; scratch outperforms upcycled at 300B budget

Practical UseIf your MoE budget is >= 2× the dense pretraining cost, train MoE from scratch; otherwise upcycle a dense checkpoint to save cost.

Evidence RefSection 3.2–3.3, Fig.1

Gating logit normalization sharpens router outputs and lowers training loss and token drop rates in experiments.

NumbersTest on 2.5B MoE (16 experts): normalization reduced training loss and token drop (Fig.23); λ=1 chosen

Practical UseAdd logit normalization before gate softmax and start with λ=1 to get crisper routing and lower token drop during MoE training.

Evidence RefSection 4.1, Fig.2–3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
CEVAL82.2Qwen1.5-72B 84.1-1.9 vs top reportedCEVAL (Chinese)Table 1 shows Skywork-MoE 82.2 vs Qwen1.5 84.1Table 1
CMMLU79.5Deepseek-V2 84.0-4.5 vs top reportedCMMLU (Chinese multitask)Table 1 reports Skywork-MoE 79.5, Deepseek-V2 84.0Table 1

What To Try In 7 Days

If you have a dense checkpoint, run a short upcycle experiment and a matched small from-scratch run to compare validation loss under your token budget.

Add gating logit normalization in your MoE router: normalize logits, set λ=1, and track Max1/Max2 and token-drop.

Implement per-layer adaptive auxiliary α: monitor token-drop and update α with f(d)=ξd (start ξ=0.2, α_max=0.01, β≈0.99).

Agent Features

Frameworks
SkyworkMegatronMegatron-LM
Architectures
MoETransformer

Optimization Features

Token Efficiency
activated parameters reduce per-token compute (22B activated of 146B)
Infra Optimization
12-way pipeline, 4-way tensor-expert parallelism, 32-way data parallelism, ZeRO-1
Model Optimization
MoEgating logit normalization
System Optimization
communication reduction for expert parallelismoverlap comm with computation
Training Optimization
upcycling from dense checkpointsadaptive auxiliary loss coefficientsExpert Data Parallel (EDP)non-uniform pipeline parallelismkernel fusion and overlapped communication

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments rely on internal datasets and a SkyPile subset; public reproducibility details are limited.

Some experiments are small-scale; effects may vary when scaling or on different data mixes.

When Not To Use

Do not upcycle if your MoE budget is >= 2× the dense pretraining cost—train from scratch instead.

Avoid assuming λ=1 is universally optimal; tune before wide deployment.

Failure Modes

Gate entropy remains high despite normalization if λ mis-set, causing near-uniform expert averaging and poor specialization.

Adaptive auxiliary loss can over-regularize and hurt next-token accuracy if tied too tightly to token-drop without safeguards.

Core Entities

Models

Skywork-MoESkywork-13BDeepseek-V2Deepseek-67BQwen1.5-72BLlama2-70BLlama3-70BMixtral 8*7BMixtral 8*22B

Metrics

CEVAL scoreCMMLU scoreMMLU scoreAccuracyMATH scoreHumanEval pass@k-style scoretraining losstoken drop rateMax1/Max2 gate ratioModel Floating-point Utilization (MFU)throughput (tokens/GPU/s)

Datasets

SkyPile (subset)CEVALCMMLUMMLUGSM8KMATHHumanEval

Benchmarks

CEVALCMMLUMMLUGSM8KMATHHumanEval

Context Entities

Models

Switch TransformerGLaM/GLAMDeepseek-V1

Metrics

same as core

Datasets

Skywork-13B pretraining data (3.2T + extra 2T tokens internal)

Benchmarks

same as core