MLA + DeepSeekMoE: a 236B MoE LLM with 21B active params, 128K context, 42.5% training savings
DeepSeek-V2 shows you can run a very large-capacity model but only activate ~21B params per token, cutting training GPU-hours and inference memory. That lowers operational cost and lets you serve longer contexts or larger batches on the same hardware.
Key finding
DeepSeek-V2 activates far fewer parameters per token while keeping strong accuracy.
Numbers: 236B total / 21B activated params

