Most experts in an MoE LLM never fire on MMLU; gating is near-uniform and experts vary in accuracy

February 24, 20256 min

Overview

Decision SnapshotNeeds Validation

Single-model, single-benchmark analysis gives useful pointers for practice (pruning and gating fixes) but limited generality; validate on your MoE checkpoints and downstream tasks.

Citations0

Evidence Strength0.40

Confidence0.60

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 50%

Novelty: 40%

Authors

Andrei Chernov

Links

Abstract / PDF / Data

Why It Matters For Business

You can likely shrink or speed MoE models on quiz-style tasks by removing inactive experts and by tuning routing to favor high-performing experts, cutting compute and fine-tuning cost without retraining from scratch.

Who Should Care

Summary TLDR

The paper inspects the OLMoE mixture-of-experts LLM on the quiz-style MMLU benchmark (14,042 questions). Key findings: over 60% of the 64 experts per layer were never activated; the gating outputs across the top-8 experts are close to uniform (entropy near the top-8 maximum); and some experts are much more accurate than others (example: one expert hits ~80% vs another ~34%). The author suggests pruning inactive experts and reweighting gating toward high-performing experts as practical steps.

Problem Statement

MoE layers are widely used but little work analyzes which experts actually contribute at inference time. For quiz-style tasks, we need to know how many experts are used, whether gating is truly sparse, and if experts differ in accuracy. Answers can guide pruning, routing fixes, and robustness checks.

Main Contribution

Per-expert activation analysis of OLMoE on the MMLU quiz benchmark (14,042 questions).

Measured gating output entropy and shown top-8 gating is closer to uniform than sparse across layers.

Key Findings

Most experts never activate on MMLU.

Numbers>60% of 64 experts never activated

Practical UseOn quiz-style data you can likely prune many experts to shrink the model and cut fine-tuning cost without losing evaluated accuracy.

Evidence RefTables 1-2, Conclusion

Gating outputs are near-uniform rather than very sparse.

NumbersTop-8 entropy per layer ≈ 1.852.05 (max 2.0794)

Practical UseThe intended sparse routing is weakened; consider changing gating losses or routing rules to restore sparsity or purposefully rebias routing.

Evidence RefTables 1-3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Percent of experts never activated>60% of 64 expertsMMLU (all 57 subjects, 14,042 questions)Tables 1-2 report activated expert counts; conclusion states >60% inactiveTables 1-2
Gating entropy (top 8)mean ≈ 1.852.05 (max possible 2.0794)sparse expected (low entropy)MMLUTable 1-2 show mean and std per layer; Table 3 reports top-8 probabilitiesTables 1-3

What To Try In 7 Days

Run per-expert activation counts on your MoE model using a representative dataset to find inactive experts.

Simulate pruning inactive experts and measure inference and fine-tuning time plus accuracy on a held-out quiz set.

Inspect per-expert accuracy; experiment with simple gating reweighting to upweight top experts and measure net accuracy change.

Optimization Features

Model Optimization
expert pruning (remove inactive experts)model routing analysis
Training Optimization
adjust auxiliary gating loss to rebalance activationsAccuracy
Inference Optimization
reduce compute by pruning inactive expertsmonitor Top-K sensitivity to small input changes

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiment uses only one MoE model (OLMoE) and one benchmark (MMLU); results may not generalize.

Focused on quiz-style single-token multiple-choice tasks; behavior may differ on generative or open-ended tasks.

When Not To Use

Don't assume pruning inactive experts is safe for non-quiz or generative tasks without validation.

Avoid applying these gating reweighting suggestions to models with different MoE sizes or different training regimes without testing.

Failure Modes

Pruning experts that are rarely activated on MMLU could remove rare but critical capabilities needed elsewhere.

Reweighting gating toward high-accuracy experts may reduce diversity and hurt out-of-distribution or multi-domain performance.

Core Entities

Models

OLMoE-1B-7B-0125-Instruct

Metrics

natural entropy (gating)Accuracyactivated experts count

Datasets

MMLU

Benchmarks

MMLU