Mixture-of-Experts Serving Papers — Parsed & Scored for Practitioners

MLA + DeepSeekMoE: a 236B MoE LLM with 21B active params, 128K context, 42.5% training savings

0.70

0.80

97

DeepSeek-V2 shows you can run a very large-capacity model but only activate ~21B params per token, cutting training GPU-hours and inference memory. That lowers operational cost and lets you serve longer contexts or larger batches on the same hardware.

Key finding

DeepSeek-V2 activates far fewer parameters per token while keeping strong accuracy.

Numbers: 236B total / 21B activated params

Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

0.70

0.50

0.80

20

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Key finding

Instruction tuning increases MoE gains vs dense models.

Numbers: 7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

BiMediX — a bilingual English/Arabic medical Mixture-of-Experts LLM plus a 1.3M bilingual medical instruction set

0.30

0.60

0.80

4

BiMediX shows you can deliver bilingual medical accuracy with much lower serving cost: similar or better accuracy than large 70B models while running 8x faster, making research deployments and low-latency prototypes cheaper.

Key finding

BiMediX beats Med42 and Meditron on English medical benchmarks.

Numbers: avg +2.5% vs Med42; +4.1% vs Meditron (English benchmarks)

Drop MoE layers or blocks + quantize experts to cut memory and run time with small accuracy loss

0.70

0.40

0.80

3

Coarse structural compression plus 4-bit quantization can cut inference cost and memory enough to run large MoE models on cheaper GPUs while losing only a small fraction of task accuracy.

Key finding

Block Drop + 4-bit quantization produces major runtime and memory reductions while keeping most accuracy.

Numbers: 6.05× speedup; memory 20.0GB; >92% performance (Mixtral-8×7B)

Practical training recipes for MoE LLMs: when to upcycle, and how to diversify experts

0.60

0.70

3

MoE models can cut compute per-token while keeping high capacity; pick upcycling when you have a good dense checkpoint and limited budget, otherwise invest in from-scratch MoE to maximize final quality.

Key finding

Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.

Numbers: 100B tokens ≈ 2/3 C, 300B tokens ≈ 2 C; scratch outperforms upcycled at 300B budget

One-shot clustering of MLP subunits that preserves NTK to speed up fine-tuning of dense and MoE models

0.70

0.60

0.70

2

MLP Fusion reduces GPU memory and fine-tuning time while preserving training dynamics and near-original accuracy, making low-cost SFT and smaller deployed models feasible for companies running many custom fine-tunes or deploying large MoE models.

Key finding

MLP Fusion yields the lowest NTK approximation error among tested one-shot methods.

Numbers: NTK error on SST2 (RoBERTa first layer): 2826.6 ±155.1 vs SVD 4423.4

Cut expert count in SMoE models up to 75% using gradient-free pruning plus weight merging

0.70

0.60

0.80

2

EEP lowers GPU memory and inference cost for SMoE LLMs and can improve accuracy on specific downstream tasks, making large MoE models more affordable to deploy.

Key finding

EEP can cut the total experts from 8 to 2 (72% parameter drop) while keeping comparable task performance.

Numbers: 72% parameter reduction (8→2)

Post-training expert pruning and per-token expert skipping cut MoE memory and speed up inference with small accuracy tradeoffs.

0.70

0.60

0.80

1

Post-training expert pruning and online skipping lower GPU needs and speed up MoE models with small, controllable accuracy loss, letting teams deploy expensive MoE LLMs on fewer GPUs and reduce inference cost.

Key finding

Pruning 2 experts (r=6) reduces Mixtral 8x7B memory and enables single 80G GPU deployment.

Numbers: Memory r=6 = 68,383 MB (76% of original 89,926 MB) — Table 9

Split FFNs into sparse experts + a teacher-guided router to cut FLOPs and adapt LLMs with tiny data

0.50

0.60

1

FactorLLM can cut FFN compute and lower inference costs significantly while enabling fast domain adaptation with tiny datasets, enabling cheaper, faster deployment for task-specific LLMs.

Key finding

Large FFN FLOPs can be cut heavily by activating fewer experts.

Numbers: FFN GFLOPs reduced ~75% for 1R4E1K

UNCURL: cluster-and-merge pruning for Mixture-of-Experts that cuts experts at inference while keeping task accuracy

0.60

0.70

1

If you plan to deploy SMoE models on multi-GPU setups, pretraining with many experts can improve accuracy but raises inference latency and memory costs; UNCURL gives a practical offline path to shrink experts for tasks while often keeping accuracy.

Key finding

Naïve one-shot pruning by expert activation frequency hurts performance across tasks.

Numbers: Pruned 354M+(32e→8e) lower than 354M+32e on many tasks (Table 1/2)

Combine Mixture-of-Experts with LoRA and simple QA pairs to update LLMs without heavy data engineering

0.60

1

MoRAL lets teams update model knowledge cheaply and robustly using plain QA pairs and a small set of adapter parameters, reducing retraining cost and helping models stay current without wholesale re-training.

Key finding

Open-book recall accuracy improves substantially after providing context and/or MoRAL fine-tuning.

Numbers: Phi-2: open-book RA 0.82 vs closed-book RA 0.63 (MoRAL fine-tuned) → +30.15% relative (Table 1).

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

0.60

0.70

1

SUTRA reduces non‑English inference cost while improving accuracy in many widely spoken languages, letting companies deploy one efficient model globally instead of many costly language-specific models.

Key finding

Large non-English gains on MMLU vs GPT-3.5.

Numbers: Hindi: SUTRA 68 vs GPT-3.5 39 (+29 pts)

Reduce KV memory and bandwidth for MoE inference by sharding, routing, compression, and adaptive scheduling

0.60

0.70

0

If you serve MoE models or long-context LLMs, KV cache memory and network traffic are major cost drivers. PiKV helps cut per-GPU memory and inter-GPU bandwidth by sharding, selective access, and compression—reducing infrastructure needs or enabling longer contexts on the same cluster.

Key finding

Expert-sharded storage changes per-device memory scaling from proportional to E·L to proportional to L/G + L/E (analytic).

Numbers: per-device memory: O(E·L) -> O(L/G + L/E) (Section 3.1 analytic)

GRIFFIN: training-free sequence-level neuron selection that cuts FF work by 50% and speeds up generation

0.70

0.60

0.70

0

GRIFFIN cuts active FF work by half with no retraining, offering real latency and memory wins for deploy-time generation while preserving most task quality on evaluated models and datasets.

Key finding

GRIFFIN keeps performance near the full model at 50% FF sparsity on classification tasks.

Numbers: HellaSwag Llama 2 7B: 57.16 -> 57.11 accuracy (full -> GRIFFIN)

Radial Networks: token-level routing that skips whole layers to cut compute and latency

0.60

0.70

0

Routing tokens around low-impact layers can cut average compute and latency per request, reducing serving costs and allowing larger models for the same budget.

Key finding

Per-layer residual contributions shrink as model size grows.

Numbers: OPT-125M median residual ratio ≈ 20%; OPT-66B ≈ 5.9%

Prune MoEs first at expert level, then fine-grain inside experts — fast, single‑GPU, and better than either pruning alone.

0.70

0.80

0

STUN cuts serving memory and GPUs for large MoE models while keeping generation quality, enabling cheaper deployment of very large sparse models without costly retraining.

Key finding

STUN retains GSM8K generation accuracy at 40% sparsity on Snowflake Arctic with almost no loss.

Numbers: Arctic GSM8K: unpruned 70.74 → STUN (40%) 70.28 (Table 2)

APEX: fast, extensible simulator that finds cost- and energy-efficient parallel plans for LLM serving

0.70

0.60

0.80

0

APEX lets ops teams find faster or cheaper LLM serving configurations without burning expensive GPU hours, enabling targeted trade-offs between latency, throughput, and energy while meeting SLOs.

Key finding

APEX prediction fidelity is high.

Numbers: average relative error = 10.7%

CMoE: turn dense FFNs into MoE in minutes to get ~1.4–1.6× end-to-end speedups

0.70

0.60

0.70

0

CMoE lets teams cut LLM inference cost quickly by turning FFNs sparse without long retraining, enabling faster deployment and cheaper serving while allowing small targeted fine-tunes to regain quality.

Key finding

Training-free CMoE is usable immediately: 25% activation gives practical perplexity without training.

Numbers: WikiText-2 PPL: Dense 5.27, CMoE 25% TF 62.30; CMoE 25% FT 12.73 (Table 1)

Use a few verified examples plus public LoRA models and instructions to cheaply build task experts via a diversity-aware mixture-of-experts

0.70

0.60

0.70

0

You can build task-specialist LLMs cheaply by reusing public LoRA adapters and a handful of verified examples, cutting data collection and compute vs full finetuning while gaining measurable accuracy improvements.

Key finding

The proposed pipeline yields higher average accuracy than strong MoE baselines on the tested tasks.

Numbers: LLaMA2-7B avg 52.50% vs Arrow 50.68% (+1.82); Mistral-7B avg 72.77% vs Arrow 71.53% (+1.24)

Make large Mixture-of-Experts models run faster on edge GPUs by prefetching experts using adjacent-layer gate inputs

0.60

0.70

0

Fate cuts MoE inference latency on edge-class GPUs by combining cross-layer prefetch, targeted caching, and hybrid INT2/INT4 transfers—enabling richer, privacy-friendly on-device LLM features with small accuracy trade-offs.

Key finding

Cross-layer prefetch achieves very high prediction accuracy without retraining.

Numbers: 97.15% prefetch accuracy (by transferring experts above 75th-confidence percentile)

RLFA: use sports-style free agency to replace underperforming agents in multi-agent MoE systems

0.40

0.60

0.50

0

RLFA reduces downtime from outdated models by automatically replacing weak agents and limits risk from new models via probation, improving resilience in changing or adversarial domains.

Key finding

Replacing a degraded fraud agent restored detection performance.

Numbers: incumbent accuracy fell 95%→75%; shadow agent 88%→>90%

Freeze pretrained MoE experts, aggregate only shared layers, and graft one personalized expert per client for efficient federated LLM tuning

0.70

0.65

0.70

0

FLEx lowers federated communication and avoids corrupting pretrained knowledge, enabling client-specific LLM behavior with smaller bandwidth and safer global models.

Key finding

FLEx improves average instruction-following quality over federated baselines.

Numbers: ROUGE-L avg 43.13 (FLEx) vs 42.37 (best federated baseline on Table 1)

Add frequency-aware experts to a Mixture-of-Experts Transformer and pretrain to cut forecasting error on public and commercial series

0.70

0.65

0.60

0

Better short-to-long forecasting where periodic patterns exist. The model lowers error vs a leading MoE baseline and keeps inference costs similar, so operational forecasting (store traffic, sales, energy) can be more accurate without extra latency.

Key finding

MoFE-Time improves average forecast error on six public benchmarks compared to Time‑MoE.

Numbers: Average MSE 0.2755, MAE 0.3226; MSE↓ 6.95%, MAE↓ 6.02% vs Time‑MoE

Most experts in an MoE LLM never fire on MMLU; gating is near-uniform and experts vary in accuracy

0.50

0.40

0.50

0

You can likely shrink or speed MoE models on quiz-style tasks by removing inactive experts and by tuning routing to favor high-performing experts, cutting compute and fine-tuning cost without retraining from scratch.

Key finding

Most experts never activate on MMLU.

Numbers: >60% of 64 experts never activated