Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

May 24, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

20

Authors

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

Links

Abstract / PDF

Why It Matters For Business

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Summary TLDR

The paper shows that sparse Mixture-of-Experts (MoE) Transformer models perform worse than comparable dense models when directly fine-tuned on tasks, but gain far more from instruction tuning. With instruction-finetuning on a large, diverse instruction collection (FLAN family), MoE models (FLAN-MOE / FLAN-ST32B) match or beat dense baselines while using much less compute per token (example: FLAN-ST32B gets 65.4% MMLU few-shot and uses ~32.1 GFLOPs/token, under 30% the FLOPs of a FLAN‑PaLM62B equivalent). Key practical points: (1) add an instruction-tuning stage before task finetuning for MoE; (2) routing strategy, auxiliary losses, and some gate-freezing choices matter; (3) MoE models still弱

Problem Statement

MoE models add lots of parameters cheaply (sparse activation) but often underperform dense models after direct task finetuning. The paper asks: can instruction tuning (teaching models to follow natural-language instructions) unlock MoE models so they become both more accurate and more compute‑efficient than dense models?

Main Contribution

Empirical demonstration that instruction tuning is essential for MoE: without it MoE often trails dense models, with it MoE outperforms dense models on held-out few/zero-shot and finetuned tasks.

A family of instruction-tuned MoE models (FLAN-MOE / FLAN-ST variants) and controlled comparisons across sizes and routing strategies.

Design and ablation guidance: routing strategy, auxiliary losses, gate/expert freezing, and hyperparameter notes for stable instruction finetuning of MoE.

Key Findings

Instruction tuning increases MoE gains vs dense models.

Numbers7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

FLAN-ST32B matches or beats a FLAN‑PaLM62B equivalent while using far less compute.

NumbersFLAN‑ST32B: 65.4% few-shot MMLU; 32.1 GFLOPs/token (<30% FLOPs of FLAN‑PaLM62B)

Instruction tuning impact scales with task diversity more than expert count.

NumbersPerformance scales better with number of instruction tasks than #experts (Figure 1, Section 4.1)

MoE models can overfit and hurt performance if directly single-task fine-tuned.

NumbersSingle-task finetuned MoE sometimes underperforms dense T5 of comparable compute (Figures 6 and 1)

Multilingual performance is weak after English-only instruction tuning.

NumbersFLAN‑ST32B: 15.5% on MGSM and 25.1% on TyDiQA

Results

MMLU few-shot (FLAN-ST32B)

Value65.4%

BaselineFLAN-PaLM62B (comparable task suite)

BBH few-shot (FLAN-ST32B)

Value54.4%

BaselineFLAN-PaLM62B

Compute per token (FLAN-ST32B)

Value32.1 GFLOPs/token

BaselineFLAN-PaLM62B (~>100 GFLOPs/token)

Instruction-tuning effect (MoE vs dense)

Value7.1% abs improvement

Baselinedense models at similar FLOPs

Instruction-tuning boost for ST32B

Value45.2% relative boost reported

Baselinepre-instruction-tuned or dense counterpart

Who Should Care

What To Try In 7 Days

Instruction-tune your MoE checkpoint on a broad, multi-task instruction set (FLAN-like) before any task-specific finetuning.

Measure FLOPs per token and few-shot MMLU/BBH performance to compare cost vs accuracy with your dense baseline.

Run small ablations: try expert-choice vs token-choice routing, add balance auxiliary loss, and try freezing gate weights to stabilize finetuning.

Optimization Features

Token Efficiency

  • Lower FLOPs per token (example: 32.1 GFLOPs/token for FLAN-ST32B)

Infra Optimization

  • No extra inference memory cost despite larger parameter count (sparse experts inactive)

Model Optimization

  • MoE
  • expert-choice and token-choice routing

System Optimization

  • Gate freezing experiments to stabilize finetuning
  • expert dropout for regularization

Training Optimization

  • Instruction fine-tuning on FLAN collection
  • auxiliary losses (balance-loss, router Z-loss)
  • checkpoint averaging near end of training

Inference Optimization

  • Sparse activation: activate K=1 or K=2 experts per token to cut FLOPs

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • English-focused instruction finetuning caused poor multilingual performance (e.g., 15.5% MGSM, 25.1% TyDiQA).
  • MoE underperforms dense models when directly single-task finetuned without instruction tuning.
  • Performance gains saturate with more experts; benefits depend more on instruction-task diversity than sheer expert count.

When Not To Use

  • You need strong out-of-the-box multilingual performance without multilingual instruction data.
  • You have only small, single-task labeled data and cannot run broad instruction tuning first.
  • You cannot run the instruction-tuning stage or manage MoE routing/auxiliary-loss complexity.

Failure Modes

  • Overfitting during single-task finetuning leading to worse-than-dense results.
  • Expert collapse or poor expert usage without proper auxiliary loss or instruction diversity.
  • Narrow English specialization after English-only instruction tuning, hurting multilingual tasks.

Core Entities

Models

  • FLAN-MOE
  • FLAN-ST32B
  • FLAN-PaLM62B
  • FLAN-PaLM540B
  • FLAN-T5
  • T5
  • PaLM
  • Switch Transformer
  • GShard
  • FLAN-EC
  • FLAN-GS
  • FLAN-ST

Metrics

  • Accuracy
  • MMLU-Direct
  • BBH-Direct
  • Reasoning-CoT
  • QA-Direct
  • FLOPs per token
  • normalized average (macro over 4 normalized scores)

Datasets

  • FLAN collective (Muffin, T0-SF, NIV2, CoT)
  • GLaM pretraining

Benchmarks

  • MMLU
  • BBH
  • GSM8K
  • SVAMP
  • ASDiv
  • StrategyQA
  • UnifiedQA (elementary science)
  • BoolQ
  • ARC-easy
  • ARC-challenge
  • TyDiQA
  • MGSM

Context Entities

Models

  • PaLM62B
  • PaLM540B
  • ST-MOE baseline models (from prior work)

Datasets

  • BIG‑Bench subset (BBH)
  • MMLU validation set