MiniPIC cuts KV cache reuse to under 100 lines in vLLM
MiniPIC is a minimal Position-Independent Caching design for vLLM that reuses KV cache entries for recurring structured inputs regardless of prefix position, implemented in fewer than 100 lines of core-engine changes.
Score breakdown
MiniPIC removes the requirement for identical prefixes to reuse KV cache entries, enabling efficient caching of recurring structured inputs in retrieval-augmented and agentic workloads without the large server-side code changes or host-to-device transfer overhead of prior PIC approaches.
- 01MiniPIC implements Position-Independent Caching in vLLM with fewer than 100 lines of core-engine changes plus a custom attention backend.
- 02It stores unrotated K vectors in the KV cache and applies RoPE to K tiles inside attention using per-request logical positions.
- 03Three user-facing token-level primitives are exposed: block-aligned padding, span separator (SSep), and prompt depend (PDep).
Standard prefix caching in inference engines like vLLM requires requests to share identical prefixes before KV entries can be reused — a constraint that breaks down for retrieval-augmented and agentic workloads where the same documents or code files appear at varying positions across requests. Existing Position-Independent Caching (PIC) approaches either demand substantial server-side code changes or store KV state outside the server, introducing host-to-device transfer overhead. MiniPIC addresses both problems with a deliberately minimal design: it stores unrotated K vectors in the KV cache and applies RoPE to K tiles inside attention using per-request logical positions, decoupling positional encoding from cache storage.
On top of this positional-encoding-free KV cache, MiniPIC exposes three user-facing, token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep).
On top of this positional-encoding-free KV cache, MiniPIC exposes three user-facing, token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep). These modify hashing behavior and effective block-level causal attention structure, and together are sufficient to implement multiple PIC methods — Block-Attention, EPIC, and Prompt Cache — within the same running vLLM instance. The entire core-engine modification amounts to fewer than 100 lines of code plus a custom attention backend, and the design natively integrates with KV cache CPU offload implementations.
Evaluated on 2WikiMultihopQA with interleaved scheduling, MiniPIC improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.
Key facts
- 01MiniPIC implements Position-Independent Caching in vLLM with fewer than 100 lines of core-engine changes plus a custom attention backend.
- 02It stores unrotated K vectors in the KV cache and applies RoPE to K tiles inside attention using per-request logical positions.
- 03Three user-facing token-level primitives are exposed: block-aligned padding, span separator (SSep), and prompt depend (PDep).
- 04A single MiniPIC instance can realize multiple PIC methods: Block-Attention, EPIC, and Prompt Cache.
- 05On 2WikiMultihopQA with interleaved scheduling, prefill throughput improves by 49% over baseline vLLM.
- 06Cached-span time-to-first-token is reduced by up to two orders of magnitude.
- 07Worst-case overhead is 5.7%, and linear prefill scaling of uncached spans is preserved.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 13, 2026 · 08:58 UTC. How this works →