Monarch Mixer: replace attention and MLPs with sub-quadratic GEMM-friendly layers to speed long-context models
If you run models with long contexts or want lower parameter cost, M2 can cut compute or model size and improve throughput on many GPUs while keeping accuracy; expect implementation and kernel work before production parity on all hardware.
Key finding
M2-BERT matches BERT-base GLUE while cutting parameters.
Numbers: GLUE 79.9 vs 79.6; −27% params (M2 80M vs BERT 110M)

