MLA + DeepSeekMoE: a 236B MoE LLM with 21B active params, 128K context, 42.5% training savings

May 7, 202410 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.8

Citation Count

97

Authors

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruizhe Pan, Runxin Xu, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Size Zheng, T. Wang, Tian Pei, Tian Yuan, Tianyu Sun, W. L. Xiao, Wangding Zeng, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wentao Zhang, X. Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Liu, Xin Xie, Xingkai Yu, Xinnan Song, Xinyi Zhou, Xinyu Yang, Xuan Lu, Xuecheng Su, Y. Wu, Y. K. Li, Y. X. Wei, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Li, Yaohui Wang, Yi Zheng, Yichao Zhang, Yiliang Xiong, Yilong Zhao, Ying He, Ying Tang, Yishi Piao, Yixin Dong, Yixuan Tan, Yiyuan Liu, Yongji Wang, Yongqiang Guo, Yuchen Zhu, Yuduan Wang, Yuheng Zou, Yukun Zha, Yunxian Ma, Yuting Yan, Yuxiang You, Yuxuan Liu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhewen Hao, Zhihong Shao, Zhiniu Wen, Zhipeng Xu, Zhongyu Zhang, Zhuoshu Li, Zihan Wang, Zihui Gu, Zilin Li, Ziwei Xie

Links

Abstract / PDF

Why It Matters For Business

DeepSeek-V2 shows you can run a very large-capacity model but only activate ~21B params per token, cutting training GPU-hours and inference memory. That lowers operational cost and lets you serve longer contexts or larger batches on the same hardware.

Summary TLDR

DeepSeek-V2 is a 236B-parameter Mixture-of-Experts (MoE) language model that activates 21B params per token and supports 128K context. Two architectural changes—Multi-head Latent Attention (MLA) and DeepSeekMoE—shrink the inference KV cache, lower training cost, and raise deployed throughput. The authors report 42.5% lower GPU-hours per trillion tokens versus their prior 67B dense model, a 93.3% KV-cache shrink, and a 5.76× max generation throughput gain; evaluation shows top-tier open-source performance across many English and Chinese benchmarks. Checkpoints are published by the authors.

Problem Statement

Large dense LLMs get better with scale but are costly to train and slow to serve because of heavy per-token key/value (KV) caches and dense computation. The paper aims to keep model quality while cutting training cost and inference memory/latency.

Main Contribution

Multi-head Latent Attention (MLA): compresses keys and values into a small latent vector to cut KV cache size dramatically while maintaining or improving accuracy.

DeepSeekMoE: a fine-grained MoE design with shared experts, device-limited routing, auxiliary load-balance losses, and token-dropping to train large models economically.

Engineering stack for deployment: FP8 parameters, KV-cache quantization (~6 bits average), optimized FlashAttention-2 kernels, and HAI-LLM training framework.

Long-context support to 128K tokens using YaRN adaptation, plus SFT and RL (GRPO) alignment pipeline with released checkpoints.

Key Findings

DeepSeek-V2 activates far fewer parameters per token while keeping strong accuracy.

Numbers236B total / 21B activated params

Training cost per trillion tokens dropped substantially versus the previous dense model.

Numbers172.8K GPU·hrs vs 300.6K GPU·hrs (−42.5%) per 1T tokens

Inference KV cache size and served throughput improved dramatically through MLA and quantization.

NumbersKV cache reduced by 93.3%; throughput >50K tokens/s; 5.76× max throughput

Open-source benchmark performance is top-tier despite only 21B active params.

NumbersMMLU ≈78.5% (Table 2); AlpacaEval win rate 38.9; MT-Bench 8.97

MLA achieves much smaller KV cache than standard MHA while matching or exceeding accuracy.

NumbersMLA KV cache ≈4–15K elements vs MHA 110–860K elements in ablations (small/large MoE)

Results

Total parameters

Value236B

BaselineDeepSeek 67B: 67B

Activated parameters per token

Value21B

BaselineDeepSeek 67B: 67B

Training GPU-hours per 1T tokens

Value172.8K GPU·hrs

BaselineDeepSeek 67B: 300.6K GPU·hrs

KV cache reduction (authors' claim)

Value93.3% smaller KV cache vs DeepSeek 67B

BaselineDeepSeek 67B

Deployed max generation throughput

Value>50K tokens/s on 8×H800 GPUs

BaselineDeepSeek 67B max throughput

MMLU (5-shot)

Value78.5% (base)

BaselineDeepSeek 67B: 71.3%

AlpacaEval 2.0 length-controlled win rate (chat, RL)

Value38.9%

BaselineDeepSeek 67B Chat: 16.6%

MT-Bench overall score (chat, RL)

Value8.97

BaselineDeepSeek 67B Chat: 8.35

Pile-test (BPB)

Value0.606

BaselineDeepSeek 67B: 0.642

Who Should Care

What To Try In 7 Days

Prototype MLA in your inference stack to measure KV-cache reduction and memory savings.

Test FP8 + KV-cache quantization on a small model to validate throughput gains and numeric stability.

If you train MoE layers, pilot device-limited routing, expert-balance losses, and token-dropping on a small MoE to measure communication overheads.

Agent Features

Memory

  • Supports long context (128K tokens)
  • KV cache reduced by MLA; uses KV quantization

Frameworks

  • HAI-LLM
  • vLLM (inference backend)

Architectures

  • Transformer
  • MoE
  • Multi-head Latent Attention (MLA)

Optimization Features

Token Efficiency

  • Activated params per token: 21B vs 236B total
  • KV cache per token reduced to a small fraction of MHA

Infra Optimization

  • H800 GPU cluster (8× per node) with NVLink/NVSwitch and InfiniBand
  • Custom communication kernels and routing algorithms

Model Optimization

  • MLA: low-rank joint compression of keys/values to shrink KV cache
  • DeepSeekMoE: fine-grained experts + shared-expert isolation

System Optimization

  • Overlap shared-expert computation with all-to-all communication
  • Hybrid engine using different parallel strategies for train and inference

Training Optimization

  • Expert parallelism with device-limited routing (M≤3)
  • Auxiliary balance losses for expert/device/communication
  • Token-dropping during training to respect device budgets
  • ZeRO-1, pipeline parallelism, custom CUDA kernels

Inference Optimization

  • FP8 model parameters
  • KV-cache quantization to ~6 bits average
  • MLA avoids recomputing keys/values during generation
  • Optimized FlashAttention-2 kernels

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Model knowledge is static after pretraining; no continuous updates.
  • May produce hallucinations or non-factual information like other LLMs.
  • Primarily trained on Chinese and English; other languages may be weaker.
  • MoE adds routing and communication complexity; benefits depend on infra and implementation.

When Not To Use

  • When you need a very small single-device dense model for ultra-low-latency single-GPU inference.
  • When you're targeting languages beyond Chinese and English without additional data.
  • If you cannot support MoE routing and all-to-all communication in your infra.

Failure Modes

  • Routing collapse or unbalanced experts causing wasted compute and degraded accuracy.
  • Alignment tax: RL alignment may reduce performance on some standard benchmarks.
  • Quantization or aggressive compression could harm numeric stability in edge cases.
  • Hallucination and factual errors remain possible in open-ended generation.

Core Entities

Models

  • DeepSeek-V2
  • SFT
  • DeepSeek-V2 Chat (RL)
  • DeepSeek-V2-Lite
  • DeepSeek 67B

Metrics

  • Accuracy
  • Exact Match (EM)
  • Pass@1
  • Bits-Per-Byte (BPB)
  • Generation throughput (tokens/sec)
  • GPU-hours per trillion tokens

Datasets

  • internal pretraining corpus (8.1T tokens)
  • SFT

Benchmarks

  • MMLU
  • C-Eval
  • CMMLU
  • BBH
  • GSM8K
  • MATH
  • HumanEval
  • MBPP
  • CRUXEval
  • TRIVIAQA
  • NaturalQuestions
  • AlignBench
  • MT-Bench
  • AlpacaEval 2.0
  • Needle In A Haystack (NIAH)

Context Entities

Models

  • Qwen1.5 72B
  • LLaMA3 70B
  • Mixtral 8x22B
  • GPT-4 (for human/AI comparisons)
  • ERNIEBot-4.0

Metrics

  • AlpacaEval length-controlled win rate
  • MT-Bench overall score
  • AlignBench overall score

Datasets

  • Pile-test
  • SC-Math6
  • LiveCodeBench

Benchmarks

  • AGIEval
  • HellaSwag
  • PIQA
  • RACE
  • DROP
  • CLUEWSC
  • CHID
  • CCPM