Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
3
Why It Matters For Business
MoE models can cut compute per-token while keeping high capacity; pick upcycling when you have a good dense checkpoint and limited budget, otherwise invest in from-scratch MoE to maximize final quality.
Summary TLDR
This report studies how to train large Mixture-of-Experts (MoE) language models in practice. Key takeaways: (1) whether to upcycle a dense checkpoint or train MoE from scratch depends on your MoE training budget relative to the dense pretraining cost (rule: if MoE budget ≥ 2× dense cost, train from scratch); (2) gating logit normalization (normalize gate logits before softmax, scale by λ) sharpens routing, lowers training loss and token drop; (3) adaptive per-layer auxiliary-loss coefficients (adjust by observed token drop rate) keeps load balanced without over-regularizing. The authors upcycled Skywork-13B to Skywork-MoE (146B params, 16 experts, 22B activated params) and report competitive
Problem Statement
Training MoE models raises three practical problems: (1) should you upcycle a dense checkpoint or train MoE from scratch given limited GPU budget? (2) how to make the gating/router actually route tokens to diverse experts? (3) how to tune the auxiliary load-balancing loss per layer so it helps rather than hurts next-token training?
Main Contribution
Empirical guidance on upcycling vs training-from-scratch with simple cost rules of thumb tied to training token budgets
Gating logit normalization: normalize gate logits before softmax and scale by λ to sharpen routing and reduce token drop
Adaptive auxiliary-loss coefficients: per-layer, token-drop-driven update rule to tune load-balance regularization
Built and evaluated Skywork-MoE (146B params, 16 experts) upcycled from Skywork-13B and compared it to open models
Engineering: Expert Data Parallel (EDP) and non-uniform pipeline splits to improve MoE training efficiency
Key Findings
Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.
Gating logit normalization sharpens router outputs and lowers training loss and token drop rates in experiments.
Adaptive per-layer auxiliary coefficients track token drop and keep load-regularization responsive.
Skywork-MoE (146B, 16 experts, 22B activated params) is competitive on standard benchmarks.
Two investigated 'fixes' gave little or mixed benefit: expert-layer learning-rate scaling and pretraining expert-specialized checkpoints.
Results
CEVAL
CMMLU
MMLU
GSM8K
MATH
HumanEval
Who Should Care
What To Try In 7 Days
If you have a dense checkpoint, run a short upcycle experiment and a matched small from-scratch run to compare validation loss under your token budget.
Add gating logit normalization in your MoE router: normalize logits, set λ=1, and track Max1/Max2 and token-drop.
Implement per-layer adaptive auxiliary α: monitor token-drop and update α with f(d)=ξd (start ξ=0.2, α_max=0.01, β≈0.99).
Agent Features
Frameworks
- SkyworkMegatron
- Megatron-LM
Architectures
- MoE
- Transformer
Optimization Features
Token Efficiency
- activated parameters reduce per-token compute (22B activated of 146B)
Infra Optimization
- 12-way pipeline, 4-way tensor-expert parallelism, 32-way data parallelism, ZeRO-1
Model Optimization
- MoE
- gating logit normalization
System Optimization
- communication reduction for expert parallelism
- overlap comm with computation
Training Optimization
- upcycling from dense checkpoints
- adaptive auxiliary loss coefficients
- Expert Data Parallel (EDP)
- non-uniform pipeline parallelism
- kernel fusion and overlapped communication
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments rely on internal datasets and a SkyPile subset; public reproducibility details are limited.
- Some experiments are small-scale; effects may vary when scaling or on different data mixes.
- Gating normalization and hyperparameters (λ, ξ, α_max) may need retuning per model and dataset.
- Benchmark comparisons cover selected open models; not a full sweep against all large MoEs.
When Not To Use
- Do not upcycle if your MoE budget is >= 2× the dense pretraining cost—train from scratch instead.
- Avoid assuming λ=1 is universally optimal; tune before wide deployment.
- Avoid investing in expert-specialized pretraining unless you can justify the extra compute with clear gains.
Failure Modes
- Gate entropy remains high despite normalization if λ mis-set, causing near-uniform expert averaging and poor specialization.
- Adaptive auxiliary loss can over-regularize and hurt next-token accuracy if tied too tightly to token-drop without safeguards.
- Large communication overheads or suboptimal parallel configs can reduce GPU utilization and increase wall time.
Core Entities
Models
- Skywork-MoE
- Skywork-13B
- Deepseek-V2
- Deepseek-67B
- Qwen1.5-72B
- Llama2-70B
- Llama3-70B
- Mixtral 8*7B
- Mixtral 8*22B
Metrics
- CEVAL score
- CMMLU score
- MMLU score
- Accuracy
- MATH score
- HumanEval pass@k-style score
- training loss
- token drop rate
- Max1/Max2 gate ratio
- Model Floating-point Utilization (MFU)
- throughput (tokens/GPU/s)
Datasets
- SkyPile (subset)
- CEVAL
- CMMLU
- MMLU
- GSM8K
- MATH
- HumanEval
Benchmarks
- CEVAL
- CMMLU
- MMLU
- GSM8K
- MATH
- HumanEval
Context Entities
Models
- Switch Transformer
- GLaM/GLAM
- Deepseek-V1
Metrics
- same as core
Datasets
- Skywork-13B pretraining data (3.2T + extra 2T tokens internal)
Benchmarks
- same as core

