Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
20
Why It Matters For Business
Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.
Summary TLDR
The paper shows that sparse Mixture-of-Experts (MoE) Transformer models perform worse than comparable dense models when directly fine-tuned on tasks, but gain far more from instruction tuning. With instruction-finetuning on a large, diverse instruction collection (FLAN family), MoE models (FLAN-MOE / FLAN-ST32B) match or beat dense baselines while using much less compute per token (example: FLAN-ST32B gets 65.4% MMLU few-shot and uses ~32.1 GFLOPs/token, under 30% the FLOPs of a FLAN‑PaLM62B equivalent). Key practical points: (1) add an instruction-tuning stage before task finetuning for MoE; (2) routing strategy, auxiliary losses, and some gate-freezing choices matter; (3) MoE models still弱
Problem Statement
MoE models add lots of parameters cheaply (sparse activation) but often underperform dense models after direct task finetuning. The paper asks: can instruction tuning (teaching models to follow natural-language instructions) unlock MoE models so they become both more accurate and more compute‑efficient than dense models?
Main Contribution
Empirical demonstration that instruction tuning is essential for MoE: without it MoE often trails dense models, with it MoE outperforms dense models on held-out few/zero-shot and finetuned tasks.
A family of instruction-tuned MoE models (FLAN-MOE / FLAN-ST variants) and controlled comparisons across sizes and routing strategies.
Design and ablation guidance: routing strategy, auxiliary losses, gate/expert freezing, and hyperparameter notes for stable instruction finetuning of MoE.
Key Findings
Instruction tuning increases MoE gains vs dense models.
FLAN-ST32B matches or beats a FLAN‑PaLM62B equivalent while using far less compute.
Instruction tuning impact scales with task diversity more than expert count.
MoE models can overfit and hurt performance if directly single-task fine-tuned.
Multilingual performance is weak after English-only instruction tuning.
Results
MMLU few-shot (FLAN-ST32B)
BBH few-shot (FLAN-ST32B)
Compute per token (FLAN-ST32B)
Instruction-tuning effect (MoE vs dense)
Instruction-tuning boost for ST32B
Who Should Care
What To Try In 7 Days
Instruction-tune your MoE checkpoint on a broad, multi-task instruction set (FLAN-like) before any task-specific finetuning.
Measure FLOPs per token and few-shot MMLU/BBH performance to compare cost vs accuracy with your dense baseline.
Run small ablations: try expert-choice vs token-choice routing, add balance auxiliary loss, and try freezing gate weights to stabilize finetuning.
Optimization Features
Token Efficiency
- Lower FLOPs per token (example: 32.1 GFLOPs/token for FLAN-ST32B)
Infra Optimization
- No extra inference memory cost despite larger parameter count (sparse experts inactive)
Model Optimization
- MoE
- expert-choice and token-choice routing
System Optimization
- Gate freezing experiments to stabilize finetuning
- expert dropout for regularization
Training Optimization
- Instruction fine-tuning on FLAN collection
- auxiliary losses (balance-loss, router Z-loss)
- checkpoint averaging near end of training
Inference Optimization
- Sparse activation: activate K=1 or K=2 experts per token to cut FLOPs
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- English-focused instruction finetuning caused poor multilingual performance (e.g., 15.5% MGSM, 25.1% TyDiQA).
- MoE underperforms dense models when directly single-task finetuned without instruction tuning.
- Performance gains saturate with more experts; benefits depend more on instruction-task diversity than sheer expert count.
When Not To Use
- You need strong out-of-the-box multilingual performance without multilingual instruction data.
- You have only small, single-task labeled data and cannot run broad instruction tuning first.
- You cannot run the instruction-tuning stage or manage MoE routing/auxiliary-loss complexity.
Failure Modes
- Overfitting during single-task finetuning leading to worse-than-dense results.
- Expert collapse or poor expert usage without proper auxiliary loss or instruction diversity.
- Narrow English specialization after English-only instruction tuning, hurting multilingual tasks.
Core Entities
Models
- FLAN-MOE
- FLAN-ST32B
- FLAN-PaLM62B
- FLAN-PaLM540B
- FLAN-T5
- T5
- PaLM
- Switch Transformer
- GShard
- FLAN-EC
- FLAN-GS
- FLAN-ST
Metrics
- Accuracy
- MMLU-Direct
- BBH-Direct
- Reasoning-CoT
- QA-Direct
- FLOPs per token
- normalized average (macro over 4 normalized scores)
Datasets
- FLAN collective (Muffin, T0-SF, NIV2, CoT)
- GLaM pretraining
Benchmarks
- MMLU
- BBH
- GSM8K
- SVAMP
- ASDiv
- StrategyQA
- UnifiedQA (elementary science)
- BoolQ
- ARC-easy
- ARC-challenge
- TyDiQA
- MGSM
Context Entities
Models
- PaLM62B
- PaLM540B
- ST-MOE baseline models (from prior work)
Datasets
- BIG‑Bench subset (BBH)
- MMLU validation set

