Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

May 24, 20238 min

Overview

Decision SnapshotReady For Pilot

The paper backs claims with extensive benchmark comparisons and ablations across model sizes and routing choices, showing consistent FLOPs-to-accuracy tradeoffs; results are strong for English tasks but show clear multilingual and finetuning caveats.

Citations20

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

Links

Abstract / PDF

Why It Matters For Business

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Who Should Care

Summary TLDR

The paper shows that sparse Mixture-of-Experts (MoE) Transformer models perform worse than comparable dense models when directly fine-tuned on tasks, but gain far more from instruction tuning. With instruction-finetuning on a large, diverse instruction collection (FLAN family), MoE models (FLAN-MOE / FLAN-ST32B) match or beat dense baselines while using much less compute per token (example: FLAN-ST32B gets 65.4% MMLU few-shot and uses ~32.1 GFLOPs/token, under 30% the FLOPs of a FLAN‑PaLM62B equivalent). Key practical points: (1) add an instruction-tuning stage before task finetuning for MoE; (2) routing strategy, auxiliary losses, and some gate-freezing choices matter; (3) MoE models still弱

Problem Statement

MoE models add lots of parameters cheaply (sparse activation) but often underperform dense models after direct task finetuning. The paper asks: can instruction tuning (teaching models to follow natural-language instructions) unlock MoE models so they become both more accurate and more compute‑efficient than dense models?

Main Contribution

Empirical demonstration that instruction tuning is essential for MoE: without it MoE often trails dense models, with it MoE outperforms dense models on held-out few/zero-shot and finetuned tasks.

A family of instruction-tuned MoE models (FLAN-MOE / FLAN-ST variants) and controlled comparisons across sizes and routing strategies.

Key Findings

Instruction tuning increases MoE gains vs dense models.

Numbers7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

Practical UseIf you train MoE, run a broad instruction-tuning stage first—expect multi-point gains on zero/few-shot benchmarks.

Evidence RefSection 3.2; Figure 2

FLAN-ST32B matches or beats a FLAN‑PaLM62B equivalent while using far less compute.

NumbersFLAN‑ST32B: 65.4% few-shot MMLU; 32.1 GFLOPs/token (<30% FLOPs of FLAN‑PaLM62B)

Practical UseYou can reach PaLM‑class accuracy with a MoE model that costs ~1/3 the per-token compute—useful for cost-sensitive inference.

Evidence RefSection 3.3; Table 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MMLU few-shot (FLAN-ST32B)65.4%FLAN-PaLM62B (comparable task suite)MMLU few-shotTable 1; Section 3.3Table 1
BBH few-shot (FLAN-ST32B)54.4%FLAN-PaLM62BBBH few-shotTable 1; Section 3.3Table 1

What To Try In 7 Days

Instruction-tune your MoE checkpoint on a broad, multi-task instruction set (FLAN-like) before any task-specific finetuning.

Measure FLOPs per token and few-shot MMLU/BBH performance to compare cost vs accuracy with your dense baseline.

Run small ablations: try expert-choice vs token-choice routing, add balance auxiliary loss, and try freezing gate weights to stabilize finetuning.

Optimization Features

Token Efficiency
Lower FLOPs per token (example: 32.1 GFLOPs/token for FLAN-ST32B)
Infra Optimization
No extra inference memory cost despite larger parameter count (sparse experts inactive)
Model Optimization
MoEexpert-choice and token-choice routing
System Optimization
Gate freezing experiments to stabilize finetuningexpert dropout for regularization
Training Optimization
Instruction fine-tuning on FLAN collectionauxiliary losses (balance-loss, router Z-loss)checkpoint averaging near end of training
Inference Optimization
Sparse activation: activate K=1 or K=2 experts per token to cut FLOPs

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

English-focused instruction finetuning caused poor multilingual performance (e.g., 15.5% MGSM, 25.1% TyDiQA).

MoE underperforms dense models when directly single-task finetuned without instruction tuning.

When Not To Use

You need strong out-of-the-box multilingual performance without multilingual instruction data.

You have only small, single-task labeled data and cannot run broad instruction tuning first.

Failure Modes

Overfitting during single-task finetuning leading to worse-than-dense results.

Expert collapse or poor expert usage without proper auxiliary loss or instruction diversity.

Core Entities

Models

FLAN-MOEFLAN-ST32BFLAN-PaLM62BFLAN-PaLM540BFLAN-T5T5PaLMSwitch TransformerGShardFLAN-ECFLAN-GSFLAN-ST

Metrics

AccuracyMMLU-DirectBBH-DirectReasoning-CoTQA-DirectFLOPs per tokennormalized average (macro over 4 normalized scores)

Datasets

FLAN collective (Muffin, T0-SF, NIV2, CoT)GLaM pretraining

Benchmarks

MMLUBBHGSM8KSVAMPASDivStrategyQAUnifiedQA (elementary science)BoolQARC-easyARC-challengeTyDiQAMGSM

Context Entities

Models

PaLM62BPaLM540BST-MOE baseline models (from prior work)

Datasets

BIG‑Bench subset (BBH)MMLU validation set