★ Rank 03 today·NEW·Jun 12, 2026·1 min readNew Models & Releases

vLLM v0.23.0 ships DeepSeek-V4 hardening and Model Runner V2 expansion

vLLM v0.23.0 lands 408 commits from 200 contributors, bringing major DeepSeek-V4 optimizations, Model Runner V2 defaults for Llama and Mistral, a maturing Rust frontend, Transformers v5 compatibility, and multi-tier KV cache offloading.

GitHub: vllm-project/vllm·khluu

Read at source

Composite · rank 03

7.2

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The release makes Model Runner V2 the default for two of the most widely deployed model families (Llama and Mistral), bringing its performance improvements — including pipeline-parallel bubble elimination and breakable CUDA graphs — to a much broader set of deployments.

01408 commits from 200 contributors (63 new) are included in v0.23.0.
02DeepSeek-V4's sparse MLA metadata is decoupled from DeepSeek-V3.2 and the model is detached from `torch.compile`.
03Model Runner V2 is now the default for Llama and Mistral dense models, in addition to Qwen3.

Summary— our read of the original

vLLM v0.23.0 is a broad release built from 408 commits by 200 contributors (63 of them new). The headline engine change is the continued maturation of DeepSeek-V4: its sparse MLA metadata is now decoupled from DeepSeek-V3.2, it gained a TRTLLM-gen attention kernel, EPLB support for the Mega-MoE, selective prefix-cache retention for sliding-window KV cache, and an index-share feature for DSA MTP. The model was also detached from `torch.compile`, its attention and RoPE paths were refactored, and an XPU attention decode path was added.

Model Runner V2 (MRv2) is now selected by default for Llama and Mistral dense models, joining Qwen3.

Model Runner V2 (MRv2) is now selected by default for Llama and Mistral dense models, joining Qwen3. It gained a FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, kernel block-size support for hybrid models, and Gemma 4 MTP. The experimental Rust frontend grew substantially, adding a streaming `generate` endpoint, dynamic LoRA endpoints, `/version` and `/server_info` endpoints, a server-router extension hook, request-ID headers, and new tool parsers for InternLM2, hy_v3, Phi-4-mini, and Gemma4.

Additional highlights include encoder-free Gemma 4 Unified support, Transformers v5 compatibility with vendored MiniCPM-V/O processors, and a multi-tier KV cache offloading framework that gained an object-store secondary tier and HMA enabled by default for capable connectors. New models added in this release include Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, and Cohere Mini Code. Reasoning and tool-call parsing are now unified behind a single `Parser.parse()` interface. Note that Minimax M3 is not yet supported in this version.

Key facts

01408 commits from 200 contributors (63 new) are included in v0.23.0.
02DeepSeek-V4's sparse MLA metadata is decoupled from DeepSeek-V3.2 and the model is detached from `torch.compile`.
03Model Runner V2 is now the default for Llama and Mistral dense models, in addition to Qwen3.
04The Rust frontend added a streaming `generate` endpoint, dynamic LoRA endpoints, and new tool parsers.
05Multi-tier KV cache offloading gained an object-store secondary tier and HMA enabled by default for capable connectors.
06New models include Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, and Cohere Mini Code.
07Minimax M3 is explicitly not supported in this version.

Topics

#model-release #infrastructure #open-source #llm-serving #optimization

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 14, 2026 · 09:08 UTC. How this works →

★ Rank 03 today·NEW·Jun 12, 2026·1 min readNew Models & Releases

vLLM v0.23.0 ships DeepSeek-V4 hardening and Model Runner V2 expansion

GitHub: vllm-project/vllm·khluu

Read at source

Composite · rank 03

7.2

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01408 commits from 200 contributors (63 new) are included in v0.23.0.
02DeepSeek-V4's sparse MLA metadata is decoupled from DeepSeek-V3.2 and the model is detached from `torch.compile`.
03Model Runner V2 is now the default for Llama and Mistral dense models, in addition to Qwen3.

Summary— our read of the original

Model Runner V2 (MRv2) is now selected by default for Llama and Mistral dense models, joining Qwen3.

Key facts

01408 commits from 200 contributors (63 new) are included in v0.23.0.
02DeepSeek-V4's sparse MLA metadata is decoupled from DeepSeek-V3.2 and the model is detached from `torch.compile`.
03Model Runner V2 is now the default for Llama and Mistral dense models, in addition to Qwen3.
04The Rust frontend added a streaming `generate` endpoint, dynamic LoRA endpoints, and new tool parsers.
05Multi-tier KV cache offloading gained an object-store secondary tier and HMA enabled by default for capable connectors.
06New models include Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, and Cohere Mini Code.
07Minimax M3 is explicitly not supported in this version.

Topics

#model-release #infrastructure #open-source #llm-serving #optimization

Methodology

Score breakdown

Key facts

Topics

More in New Models & Releases.

Score breakdown

Key facts

Topics

More in New Models & Releases.