Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

June 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Jakub Krajewski, Marcin Chochowski, Daniel Korzekwa

Links

Abstract / PDF

Why It Matters For Business

Fine-grained MoE can match or beat more expensive MoE variants while lowering training steps and inference-active compute, so you can get higher-quality models with less compute or cheaper inference at 50B-scale.

Summary TLDR

This paper studies fine-grained MoE — splitting FFN layers into many smaller experts and routing tokens to multiple experts — and shows it can improve convergence and downstream accuracy up to 56B total parameters. In matched-FLOPs comparisons (G=8 vs G=1), fine-grained MoE often lowers validation loss and raises average benchmark accuracy. Gains grow with longer pretraining; careful router design (softmax after Top-k) and standard load-balancing are critical. The paper provides recipes, ablations, and practical warnings about hardware and router training.

Problem Statement

Standard MoE uses few large experts. Recent fine-grained MoE (many small experts) may improve training efficiency and quality, but its scaling behavior and practical training choices are not well evaluated at large scales. This work measures convergence, downstream accuracy, and training design up to 56B parameters to give actionable guidance.

Main Contribution

Controlled empirical comparison of standard vs fine-grained MoE up to 56B total (17B active) parameters.

Practical training recipes and ablations: router ordering, load balancing, expert capacity, and continued pretraining.

Quantified effect of granularity (G=8) on validation loss, downstream benchmarks, and training step savings across token budgets.

Key Findings

Fine-grained MoE (G=8) lowers validation loss and raises average benchmark scores versus standard MoE at large scale.

Numbers56B: Avg accuracy 1xG1=57.3 -> 1xG8=59.0; Valid loss 1.811 -> 1.779

Fine-grained MoE reduces training steps needed to reach baseline loss, with savings growing for longer training.

Numbers11B models: 1xG8 saves 27.9% steps at 50B tokens; 33.6% at 100B

At matched total params and FLOPs, 1xFLOPs-G8 can match 2xFLOPs-G1 performance while activating fewer experts.

Numbers11B, 100B tokens: 1xG8 ~ 2xG1 in final loss; 56B: 1xG8 matches 2xG1

Router learning lags early in training: routers initially concentrate on the top-1 expert and only later utilize extra experts.

NumbersRouter logits show strong top-1 mass early and spread over time (Fig. 4)

Applying softmax after Top-k selection improved validation loss for fine-grained models.

Numbers1xG8 valid loss: softmax-before=2.219 -> after=2.183; 2xG8: 2.194 -> 2.166

Results

Accuracy

Value1xG1=48.4% -> 1xG8=50.6%

Baseline1xFLOPs-G1

11B validation loss

Value1xG1=2.233 -> 1xG8=2.183

Baseline1xFLOPs-G1

Training step savings (reach baseline loss)

Value1xG8 saved 27.9% steps

Baseline1xFLOPs-G1 at 50B tokens

Accuracy

Value1xG1=57.3% -> 1xG8=59.0% -> 2xG1=58.8% -> 2xG8=60.5%

Baseline1xFLOPs-G1

56B validation loss

Value1xG1=1.811 -> 1xG8=1.779 -> 2xG8=1.757

Baseline1xFLOPs-G1

Router ordering effect (valid loss)

Value1xG8: softmax-before=2.219 -> after=2.183

Baselinesoftmax-before Top-k

Who Should Care

What To Try In 7 Days

Train a matched-FLOPs prototype with granularity G=8 on your dataset to compare validation loss and sample efficiency.

Switch router ordering to softmax-after-Top-k for k>1 and measure validation loss change.

Instrument router logits and expert load; verify load balancing and monitor early-stage top-1 concentration.

Optimization Features

Token Efficiency

  • Training-step savings up to ~33–39% reported for fine-grained variants on longer horizons

Model Optimization

  • Fine-grained MoE: many smaller experts preserves non-router FLOPs while increasing expert pool
  • Match total params and non-router FLOPs when comparing configurations

System Optimization

  • Ensure expert parallel mapping yields balanced token load across devices to avoid stragglers

Training Optimization

  • Softmax after Top-k improves fine-grained training (when k>1)
  • Use load-balancing auxiliary loss and capacity factor to avoid expert collapse
  • Continue pretraining on a filtered high-quality dataset to boost benchmarks

Inference Optimization

  • Higher granularity (G=8) can match higher-activation variants while activating fewer experts, reduci

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Hardware and MFU differences can change practical efficiency; paper assumes uniform MFU across variants.
  • Experiments focus on pretraining only; finetuning and deployment trade-offs are untested.
  • Router learns slowly early, so gains depend on long enough token budgets and router training choices.
  • Implementation complexity and efficient expert sharding for very high granularity require extra engineering.

When Not To Use

  • When you have a very short pretraining budget (small token horizon), as fine-grained gains are smaller early.
  • If your hardware or sharding strategy cannot balance expert load across devices.
  • When you cannot implement softmax-after-Top-k safely for gradient flow in k=1 setups.

Failure Modes

  • Router concentrates on top-1 expert early, negating extra Top-k activations until later training.
  • Expert load imbalance causing stragglers if load-balancing or capacity is not tuned.
  • Implementation overhead and poor MFU can erase theoretical training-step savings.

Core Entities

Models

  • Fine-grained MoE (G=8)
  • Switch-like MoE (Top-1, G=1)
  • Mixtral-like MoE (Top-2, G=1)
  • Mixtral-like fine-grained (Top-16, G=8)

Metrics

  • Validation loss
  • Accuracy
  • Training step savings (%)

Datasets

  • Large diverse multilingual corpus (text+code) up to 300B tokens
  • High-quality filtered alignment-style QA (continued pretraining)

Benchmarks

  • ARC-Challenge
  • ARC-Easy
  • CommonsenseQA
  • HellaSwag
  • MMLU
  • OpenBookQA
  • PIQA
  • RACE
  • SocialIQA
  • TruthfulQA
  • WinoGrande