Most experts in an MoE LLM never fire on MMLU; gating is near-uniform and experts vary in accuracy

Overview

Decision SnapshotNeeds Validation

Single-model, single-benchmark analysis gives useful pointers for practice (pruning and gating fixes) but limited generality; validate on your MoE checkpoints and downstream tasks.

Citations0

Evidence Strength0.40

Confidence0.60

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 40%

Authors

Andrei Chernov

Links

Abstract / PDF / Data

Why It Matters For Business

You can likely shrink or speed MoE models on quiz-style tasks by removing inactive experts and by tuning routing to favor high-performing experts, cutting compute and fine-tuning cost without retraining from scratch.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

The paper inspects the OLMoE mixture-of-experts LLM on the quiz-style MMLU benchmark (14,042 questions). Key findings: over 60% of the 64 experts per layer were never activated; the gating outputs across the top-8 experts are close to uniform (entropy near the top-8 maximum); and some experts are much more accurate than others (example: one expert hits ~80% vs another ~34%). The author suggests pruning inactive experts and reweighting gating toward high-performing experts as practical steps.

Problem Statement

MoE layers are widely used but little work analyzes which experts actually contribute at inference time. For quiz-style tasks, we need to know how many experts are used, whether gating is truly sparse, and if experts differ in accuracy. Answers can guide pruning, routing fixes, and robustness checks.

Main Contribution

Per-expert activation analysis of OLMoE on the MMLU quiz benchmark (14,042 questions).

Measured gating output entropy and shown top-8 gating is closer to uniform than sparse across layers.

Key Findings

Most experts never activate on MMLU.

Numbers>60% of 64 experts never activated

Practical UseOn quiz-style data you can likely prune many experts to shrink the model and cut fine-tuning cost without losing evaluated accuracy.

Evidence RefTables 1-2, Conclusion

Gating outputs are near-uniform rather than very sparse.

NumbersTop-8 entropy per layer ≈ 1.85–2.05 (max 2.0794)

Practical UseThe intended sparse routing is weakened; consider changing gating losses or routing rules to restore sparsity or purposefully rebias routing.

Evidence RefTables 1-3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Percent of experts never activated	>60% of 64 experts	—	—	MMLU (all 57 subjects, 14,042 questions)	Tables 1-2 report activated expert counts; conclusion states >60% inactive	Tables 1-2
Gating entropy (top 8)	mean ≈ 1.85–2.05 (max possible 2.0794)	sparse expected (low entropy)	—	MMLU	Table 1-2 show mean and std per layer; Table 3 reports top-8 probabilities	Tables 1-3

What To Try In 7 Days

Run per-expert activation counts on your MoE model using a representative dataset to find inactive experts.

Simulate pruning inactive experts and measure inference and fine-tuning time plus accuracy on a held-out quiz set.

Inspect per-expert accuracy; experiment with simple gating reweighting to upweight top experts and measure net accuracy change.

Optimization Features

Model Optimization

expert pruning (remove inactive experts)model routing analysis

Training Optimization

adjust auxiliary gating loss to rebalance activationsAccuracy

Inference Optimization

reduce compute by pruning inactive expertsmonitor Top-K sensitivity to small input changes

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://arxiv.org/abs/2009.03300 (MMLU)

Risks & Boundaries

Limitations

Experiment uses only one MoE model (OLMoE) and one benchmark (MMLU); results may not generalize.

Focused on quiz-style single-token multiple-choice tasks; behavior may differ on generative or open-ended tasks.

When Not To Use

Don't assume pruning inactive experts is safe for non-quiz or generative tasks without validation.

Avoid applying these gating reweighting suggestions to models with different MoE sizes or different training regimes without testing.

Failure Modes

Pruning experts that are rarely activated on MMLU could remove rare but critical capabilities needed elsewhere.

Reweighting gating toward high-accuracy experts may reduce diversity and hurt out-of-distribution or multi-domain performance.

Core Entities

Models

OLMoE-1B-7B-0125-Instruct

Metrics

natural entropy (gating)Accuracyactivated experts count

Datasets

MMLU

Benchmarks

MMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most experts never activate on MMLU.

Gating outputs are near-uniform rather than very sparse.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

Key finding

Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

Key finding

Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

Key finding