Jun 10, 2026·1 min readResearch Papers

Bebop's TV loss pushes MTP acceptance to 95%, delivering 1.8x RL speedup

Researchers introduce Bebop, a framework that uses a novel end-to-end TV loss and probabilistic rejection sampling to fix degrading Multi-Token Prediction acceptance rates during RL training, achieving up to 1.8x end-to-end acceleration on Qwen3 models.

ArXiv·Yucheng Li, Huiqiang Jiang, Yang Xu

Read at source

Composite

6.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The work removes the rollout stage as the key bottleneck in RL training pipelines by showing that a pre-RL MTP training recipe with TV loss and rejection sampling sustains high acceptance rates throughout RL without costly online updates, delivering up to 1.8x end-to-end acceleration.

01Bebop is a systematic framework for integrating Multi-Token Prediction (MTP) into large-scale RL training pipelines for LLMs.
02MTP acceptance rates degrade during RL training due to a clear negative linear relationship between model entropy and acceptance rate.
03Probabilistic rejection sampling outperforms greedy draft sampling in mitigating entropy-induced acceptance rate degradation.

Summary— our read of the original

Yucheng Li, Huiqiang Jiang, and Yang Xu introduce Bebop, a framework addressing a critical bottleneck in reinforcement learning (RL) training for large language models: the rollout stage. While Multi-Token Prediction (MTP) via speculative decoding is a natural candidate for accelerating rollouts, prior work has observed that MTP acceptance rates degrade significantly during RL training. Bebop provides a systematic diagnosis and practical recipes to fix this problem at scale.

Building on this, the authors show that probabilistic rejection sampling largely alleviates the entropy-induced disturbance compared to greedy draft sampling.

The paper's first key finding is that MTP acceptance rates are fundamentally bounded by model entropy fluctuations, with a clear negative linear relationship between rising entropy in the RL stage and acceptance rate. Building on this, the authors show that probabilistic rejection sampling largely alleviates the entropy-induced disturbance compared to greedy draft sampling. They further identify that conventional MTP training objectives — cross-entropy or KL divergence — are suboptimal in this setting, and propose a novel end-to-end TV loss that directly optimizes the multi-step rejection sampling acceptance rate. This yields approximately 10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks.

A third contribution addresses online MTP training strategies during RL. The authors find that pre-RL MTP training with the end-to-end TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL process, eliminating the need for costly online MTP updating. Extensive experiments validate these findings, with the full method achieving up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

Key facts

01Bebop is a systematic framework for integrating Multi-Token Prediction (MTP) into large-scale RL training pipelines for LLMs.
02MTP acceptance rates degrade during RL training due to a clear negative linear relationship between model entropy and acceptance rate.
03Probabilistic rejection sampling outperforms greedy draft sampling in mitigating entropy-induced acceptance rate degradation.
04Conventional MTP training objectives (cross-entropy or KL divergence) are identified as suboptimal for RL settings.
05A novel end-to-end TV loss directly optimizes multi-step rejection sampling acceptance rates, yielding ~10% acceptance rate improvements.
06The method achieves up to 95% acceptance rates and up to 25% extra inference throughput gains across math reasoning, code generation, and agentic tasks.
07Pre-RL MTP training with e2e TV loss achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

Topics

#reinforcement-learning #speculative-decoding #multi-token-prediction #inference-optimization #llm-training

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 11, 2026 · 08:34 UTC. How this works →

Bebop's TV loss pushes MTP acceptance to 95%, delivering 1.8x RL speedup

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.