Sparse Transformers Papers — Parsed & Scored for Practitioners

MLA + DeepSeekMoE: a 236B MoE LLM with 21B active params, 128K context, 42.5% training savings

0.70

0.80

97

DeepSeek-V2 shows you can run a very large-capacity model but only activate ~21B params per token, cutting training GPU-hours and inference memory. That lowers operational cost and lets you serve longer contexts or larger batches on the same hardware.

Key finding

DeepSeek-V2 activates far fewer parameters per token while keeping strong accuracy.

Numbers: 236B total / 21B activated params

Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

0.70

0.50

0.80

20

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Key finding

Instruction tuning increases MoE gains vs dense models.

Numbers: 7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

Practical training recipes for MoE LLMs: when to upcycle, and how to diversify experts

0.60

0.70

3

MoE models can cut compute per-token while keeping high capacity; pick upcycling when you have a good dense checkpoint and limited budget, otherwise invest in from-scratch MoE to maximize final quality.

Key finding

Upcycling vs scratch depends on training budget: with ~2× dense training budget, from-scratch MoE wins; with smaller budgets, upcycling helps.

Numbers: 100B tokens ≈ 2/3 C, 300B tokens ≈ 2 C; scratch outperforms upcycled at 300B budget

Survey of techniques, hardware, and trade-offs to run LLMs directly on phones and edge devices

0.70

0.55

0.80

2

On-device LLMs cut latency, protect user data, and lower cloud bills—key benefits for mobile apps, privacy-focused services, and offline products.

Key finding

Edge AI market projected to grow nearly tenfold to $143.6B by 2032.

Numbers: 2022 $15.2B → 2032 $143.6B; CAGR 25.9%

Attention is provably n^C-sparse: use α·n^C top entries for stable sparse attention

0.60

0.70

1

Using a context-aware window k = α·n^C can cut attention compute while keeping accuracy; dynamic windows can be tuned to a given compute budget and often beat tiny fixed windows on long inputs.

Key finding

Keeping Ω(n^C) top entries per query suffices for vanishing approximation error.

Numbers: Error = O(1/n^C) when k = Ω(n^C) (Theorem 6.1)

Cut KV-cache accesses and speed up LLM decoding 2×–16× with post-training 'Double Sparsity'.

0.70

0.60

0.80

1

If you serve large LLMs on GPUs, Double Sparsity cuts KV-cache bandwidth and GPU memory use, delivering multi× attention speedups and up to ~2× end-to-end throughput without retraining.

Key finding

Double Sparsity keeps accuracy nearly unchanged at a combined token+channel sparsity of 1/16.

Numbers: Llama-2-7B perplexity 5.47 → 5.76 at 1/16

Split FFNs into sparse experts + a teacher-guided router to cut FLOPs and adapt LLMs with tiny data

0.50

0.60

1

FactorLLM can cut FFN compute and lower inference costs significantly while enabling fast domain adaptation with tiny datasets, enabling cheaper, faster deployment for task-specific LLMs.

Key finding

Large FFN FLOPs can be cut heavily by activating fewer experts.

Numbers: FFN GFLOPs reduced ~75% for 1R4E1K

GTSP: prune tokens, heads, layers, and weights to cut Graph Transformer compute with little or no accuracy loss

0.60

0.50

0.70

0

GTSP can reduce Graph Transformer compute and memory by tens of percent while keeping or improving accuracy on evaluated benchmarks, enabling cheaper training and deployment on constrained hardware.

Key finding

Weight pruning (50% sparsity) can increase AUC on OGBG-HIV while cutting compute.

Numbers: ROC-AUC 0.7633 → 0.7773 (+0.014); FLOPs −30.2%

Make sparse attention pick just enough tokens at runtime with top-p pruning to speed long-context LLMs

0.70

0.60

0.70

0

Twilight reduces memory reads and latency for long-context serving, cutting compute cost and enabling larger context use-cases without retraining models.

Key finding

Twilight can prune most redundant tokens after a conservative selection step.

Numbers: Pruned up to 98% of over-selected tokens

Learnable channel permutations that reduce accuracy loss from N:M structured pruning on Transformers

0.70

0.60

0.65

0

If you deploy large Transformers under N:M structured sparsity for faster inference, learnable permutations can reduce accuracy loss with a small extra tuning cost and integrate into existing pruning pipelines.

Key finding

Learned permutations improve ViT-Base top-1 under 2:4 sparsity.

Numbers: Top-1 67.9% vs RIA 66.6% (delta +1.3)

Fast Multipole Attention: a physics-inspired multilevel attention that cuts attention cost to O(n log n) or O(n)

0.70

0.80

0

FMA lowers GPU memory and inference latency for long text and high-resolution images, letting teams train bigger models or use longer contexts without buying more hardware.

Key finding

FMA changes attention complexity from quadratic to log-linear or linear.

Numbers: Complexity reduced from O(n^2) to O(n log n); O(n) with query downsampling

Delta Attention: fix sparse-prefill distribution drift and regain most full-attention accuracy with tiny overhead

0.70

0.50

0.80

0

Delta Attention lets you run long-context inference far cheaper and faster while recovering most of full-attention accuracy, lowering cloud/GPU costs and real-time latency for document- or history-heavy applications.

Key finding

Adding ∆ to sparse prefill methods gives large accuracy gains on long-context retrieval.

Numbers: avg +36 percentage points accuracy increase (paper average)

Cut KV-cache costs by predicting important tokens from RoPE frequency chunks

0.80

0.60

0.80

0

FASA cuts GPU memory needs and memory bandwidth during long-input inference with almost no accuracy loss, lowering hosting costs and enabling long-context features on smaller hardware.

Key finding

Dominant FCs are extremely sparse: a tiny fraction of FCs explain contextual attention.

Numbers: Dominant FCs ≤ 0.8% vs non-dominant ≈ 89–95% (Table 9)

Run accuracy-preserving 6:8 sparse LLMs on current GPUs and get ~1.33× inference speed with no model changes

0.75

0.55

0.65

0

SlideSparse lets teams deploy accuracy-preserving sparsity patterns (e.g., 6:8) and gain real GPU acceleration on existing hardware, reducing latency and compute cost without retraining.

Key finding

Milder structured sparsity (6:8) preserves reasoning accuracy while 2:4 destroys it.

Numbers: Qwen3 reasoning: dense 54.0% → 6:8 51.6% vs 2:4 15.3%

Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

0.60

0.70

0

Fine-grained MoE can match or beat more expensive MoE variants while lowering training steps and inference-active compute, so you can get higher-quality models with less compute or cheaper inference at 50B-scale.

Key finding

Fine-grained MoE (G=8) lowers validation loss and raises average benchmark scores versus standard MoE at large scale.

Numbers: 56B: Avg accuracy 1xG1=57.3 -> 1xG8=59.0; Valid loss 1.811 -> 1.779