Practical training recipes for MoE LLMs: when to upcycle, and how to diversify experts

Overview

Decision SnapshotNeeds Validation

The paper gives actionable, tested training knobs (upcycle rules, gate normalization, adaptive α) and reports a working 146B MoE; experiments are internal and some ablations are small-scale, so apply but validate on your data and budget.

Citations3

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou

Links

Abstract / PDF

Why It Matters For Business

MoE models can cut compute per-token while keeping high capacity; pick upcycling when you have a good dense checkpoint and limited budget, otherwise invest in from-scratch MoE to maximize final quality.

Who Should Care

ML Engineer Engineering Lead CTO Data Scientist

Summary TLDR

This report studies how to train large Mixture-of-Experts (MoE) language models in practice. Key takeaways: (1) whether to upcycle a dense checkpoint or train MoE from scratch depends on your MoE training budget relative to the dense pretraining cost (rule: if MoE budget ≥ 2× dense cost, train from scratch); (2) gating logit normalization (normalize gate logits before softmax, scale by λ) sharpens routing, lowers training loss and token drop; (3) adaptive per-layer auxiliary-loss coefficients (adjust by observed token drop rate) keeps load balanced without over-regularizing. The authors upcycled Skywork-13B to Skywork-MoE (146B params, 16 experts, 22B activated params) and report competitive

Problem Statement

Training MoE models raises three practical problems: (1) should you upcycle a dense checkpoint or train MoE from scratch given limited GPU budget? (2) how to make the gating/router actually route tokens to diverse experts? (3) how to tune the auxiliary load-balancing loss per layer so it helps rather than hurts next-token training?

Main Contribution

Empirical guidance on upcycling vs training-from-scratch with simple cost rules of thumb tied to training token budgets

Gating logit normalization: normalize gate logits before softmax and scale by λ to sharpen routing and reduce token drop

Key Findings

Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.

Numbers100B tokens ≈ 2/3 C, 300B tokens ≈ 2 C; scratch outperforms upcycled at 300B budget

Practical UseIf your MoE budget is >= 2× the dense pretraining cost, train MoE from scratch; otherwise upcycle a dense checkpoint to save cost.

Evidence RefSection 3.2–3.3, Fig.1

Gating logit normalization sharpens router outputs and lowers training loss and token drop rates in experiments.

NumbersTest on 2.5B MoE (16 experts): normalization reduced training loss and token drop (Fig.2–3); λ=1 chosen

Practical UseAdd logit normalization before gate softmax and start with λ=1 to get crisper routing and lower token drop during MoE training.

Evidence RefSection 4.1, Fig.2–3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
CEVAL	82.2	Qwen1.5-72B 84.1	-1.9 vs top reported	CEVAL (Chinese)	Table 1 shows Skywork-MoE 82.2 vs Qwen1.5 84.1	Table 1
CMMLU	79.5	Deepseek-V2 84.0	-4.5 vs top reported	CMMLU (Chinese multitask)	Table 1 reports Skywork-MoE 79.5, Deepseek-V2 84.0	Table 1

What To Try In 7 Days

If you have a dense checkpoint, run a short upcycle experiment and a matched small from-scratch run to compare validation loss under your token budget.

Add gating logit normalization in your MoE router: normalize logits, set λ=1, and track Max1/Max2 and token-drop.

Implement per-layer adaptive auxiliary α: monitor token-drop and update α with f(d)=ξd (start ξ=0.2, α_max=0.01, β≈0.99).

Agent Features

Frameworks

SkyworkMegatronMegatron-LM

Architectures

MoETransformer

Optimization Features

Token Efficiency

activated parameters reduce per-token compute (22B activated of 146B)

Infra Optimization

12-way pipeline, 4-way tensor-expert parallelism, 32-way data parallelism, ZeRO-1

Model Optimization

MoEgating logit normalization

System Optimization

communication reduction for expert parallelismoverlap comm with computation

Training Optimization

upcycling from dense checkpointsadaptive auxiliary loss coefficientsExpert Data Parallel (EDP)non-uniform pipeline parallelismkernel fusion and overlapped communication

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Experiments rely on internal datasets and a SkyPile subset; public reproducibility details are limited.

Some experiments are small-scale; effects may vary when scaling or on different data mixes.

When Not To Use

Do not upcycle if your MoE budget is >= 2× the dense pretraining cost—train from scratch instead.

Avoid assuming λ=1 is universally optimal; tune before wide deployment.

Failure Modes

Gate entropy remains high despite normalization if λ mis-set, causing near-uniform expert averaging and poor specialization.

Adaptive auxiliary loss can over-regularize and hurt next-token accuracy if tied too tightly to token-drop without safeguards.

Core Entities

Models

Skywork-MoESkywork-13BDeepseek-V2Deepseek-67BQwen1.5-72BLlama2-70BLlama3-70BMixtral 8*7BMixtral 8*22B

Metrics

CEVAL scoreCMMLU scoreMMLU scoreAccuracyMATH scoreHumanEval pass@k-style scoretraining losstoken drop rateMax1/Max2 gate ratioModel Floating-point Utilization (MFU)throughput (tokens/GPU/s)

Datasets

SkyPile (subset)CEVALCMMLUMMLUGSM8KMATHHumanEval

Benchmarks

CEVALCMMLUMMLUGSM8KMATHHumanEval

Context Entities

Models

Switch TransformerGLaM/GLAMDeepseek-V1

Metrics

same as core

Datasets

Skywork-13B pretraining data (3.2T + extra 2T tokens internal)

Benchmarks

same as core

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.

Gating logit normalization sharpens router outputs and lowers training loss and token drop rates in experiments.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

Key finding

Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

Key finding

Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

Key finding