MiniMax Sparse Attention cuts 1M-token attention compute by 28.4x
Researchers from MiniMax introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention mechanism that reduces per-token attention compute by 28.4x at 1M context while matching standard GQA quality on a 109B-parameter multimodal model.
Score breakdown
MSA demonstrates that a 109B-parameter model can process 1M-token contexts with 28.4x less attention compute and 14.2x faster prefill, making million-token agentic and code-reasoning workloads substantially more feasible at deployment scale.
- 01MSA is a blockwise sparse attention method built on Grouped Query Attention (GQA), introduced by Xunhao Lai, Weiqi Xu, and Yufeng Yang.
- 02An Index Branch scores KV blocks and selects a Top-k subset per GQA group; the Main Branch performs exact attention only over selected blocks.
- 03On a 109B-parameter natively multimodal model, MSA matches GQA quality while reducing per-token attention compute by 28.4x at 1M context.
Xunhao Lai, Weiqi Xu, and Yufeng Yang present MiniMax Sparse Attention (MSA), a blockwise sparse attention mechanism targeting the quadratic compute cost that makes standard softmax attention impractical for the hundreds-of-thousands to millions of token contexts required by agentic workflows, repository-scale code reasoning, and persistent memory applications. MSA is built on Grouped Query Attention (GQA) and introduces two components: a lightweight Index Branch that scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval; and a Main Branch that performs exact block-sparse attention over only those selected blocks. The design is deliberately streamlined around principles of simplicity and scalability, with the goal of efficient deployment across a broad range of GPUs.
The co-designed kernel achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800 GPUs.
To convert sparsity into real-world throughput gains, the authors co-design MSA with a custom GPU execution path featuring exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access patterns. Evaluated on a 109B-parameter model with native multimodal training, MSA matches GQA quality while reducing per-token attention compute by 28.4x at 1M context. The co-designed kernel achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800 GPUs. The inference kernel is open-sourced at the MiniMax-AI GitHub repository, and a production-grade natively multimodal model powered by MSA has been released on Hugging Face as MiniMax-M3.
Key facts
- 01MSA is a blockwise sparse attention method built on Grouped Query Attention (GQA), introduced by Xunhao Lai, Weiqi Xu, and Yufeng Yang.
- 02An Index Branch scores KV blocks and selects a Top-k subset per GQA group; the Main Branch performs exact attention only over selected blocks.
- 03On a 109B-parameter natively multimodal model, MSA matches GQA quality while reducing per-token attention compute by 28.4x at 1M context.
- 04A co-designed GPU kernel using exp-free Top-k selection and KV-outer sparse attention achieves 14.2x prefill speedup on H800.
- 05Decoding wall-clock speedup on H800 is 7.6x.
- 06The inference kernel is open-sourced at https://github.com/MiniMax-AI/MSA.
- 07A production-grade multimodal model powered by MSA, MiniMax-M3, has been released on Hugging Face.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →