Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
The post consolidates a set of paper-backed, tiered mitigations that, if implemented in runtimes like `llama.cpp` or `vLLM`, could close the gap between DiffusionGemma's naive inference quality and autoregressive models like Qwen without waiting for official tooling support.
MiniPIC removes the requirement for identical prefixes to reuse KV cache entries, enabling efficient caching of recurring structured inputs in retrieval-augmented and agentic workloads without the large server-side code changes or host-to-device transfer overhead of prior PIC approaches.
The work removes the rollout stage as the key bottleneck in RL training pipelines by showing that a pre-RL MTP training recipe with TV loss and rejection sampling sustains high acceptance rates throughout RL without costly online updates, delivering up to 1.8x end-to-end acceleration.