vLLM v0.23.0 ships DeepSeek-V4 hardening and Model Runner V2 expansion
vLLM v0.23.0 lands 408 commits from 200 contributors, bringing major DeepSeek-V4 optimizations, Model Runner V2 defaults for Llama and Mistral, a maturing Rust frontend, Transformers v5 compatibility, and multi-tier KV cache offloading.
Score breakdown
The release makes Model Runner V2 the default for two of the most widely deployed model families (Llama and Mistral), bringing its performance improvements — including pipeline-parallel bubble elimination and breakable CUDA graphs — to a much broader set of deployments.
- 01408 commits from 200 contributors (63 new) are included in v0.23.0.
- 02DeepSeek-V4's sparse MLA metadata is decoupled from DeepSeek-V3.2 and the model is detached from `torch.compile`.
- 03Model Runner V2 is now the default for Llama and Mistral dense models, in addition to Qwen3.
vLLM v0.23.0 is a broad release built from 408 commits by 200 contributors (63 of them new). The headline engine change is the continued maturation of DeepSeek-V4: its sparse MLA metadata is now decoupled from DeepSeek-V3.2, it gained a TRTLLM-gen attention kernel, EPLB support for the Mega-MoE, selective prefix-cache retention for sliding-window KV cache, and an index-share feature for DSA MTP. The model was also detached from `torch.compile`, its attention and RoPE paths were refactored, and an XPU attention decode path was added.
Model Runner V2 (MRv2) is now selected by default for Llama and Mistral dense models, joining Qwen3.
Model Runner V2 (MRv2) is now selected by default for Llama and Mistral dense models, joining Qwen3. It gained a FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, kernel block-size support for hybrid models, and Gemma 4 MTP. The experimental Rust frontend grew substantially, adding a streaming `generate` endpoint, dynamic LoRA endpoints, `/version` and `/server_info` endpoints, a server-router extension hook, request-ID headers, and new tool parsers for InternLM2, hy_v3, Phi-4-mini, and Gemma4.
Additional highlights include encoder-free Gemma 4 Unified support, Transformers v5 compatibility with vendored MiniCPM-V/O processors, and a multi-tier KV cache offloading framework that gained an object-store secondary tier and HMA enabled by default for capable connectors. New models added in this release include Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, and Cohere Mini Code. Reasoning and tool-call parsing are now unified behind a single `Parser.parse()` interface. Note that Minimax M3 is not yet supported in this version.
Key facts
- 01408 commits from 200 contributors (63 new) are included in v0.23.0.
- 02DeepSeek-V4's sparse MLA metadata is decoupled from DeepSeek-V3.2 and the model is detached from `torch.compile`.
- 03Model Runner V2 is now the default for Llama and Mistral dense models, in addition to Qwen3.
- 04The Rust frontend added a streaming `generate` endpoint, dynamic LoRA endpoints, and new tool parsers.
- 05Multi-tier KV cache offloading gained an object-store secondary tier and HMA enabled by default for capable connectors.
- 06New models include Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, and Cohere Mini Code.
- 07Minimax M3 is explicitly not supported in this version.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 14, 2026 · 09:08 UTC. How this works →