InterleaveThinker brings interleaved text-image generation to any image model
InterleaveThinker is a multi-agent pipeline that adds interleaved text-image generation capabilities to existing image generators using a planner agent, a critic agent, and reinforcement learning via GRPO.
Score breakdown
InterleaveThinker removes the architectural barrier that has prevented existing image generators from producing interleaved text-image sequences, extending a capability previously limited to frontier models like GPT-5 to any image generator via a plug-in multi-agent pipeline.
- 01InterleaveThinker is described as the first multi-agent pipeline to add interleaved text-image generation to any existing image generator.
- 02The pipeline uses two agents: a planner agent to organize and instruct, and a critic agent to evaluate outputs and refine instructions for regeneration.
- 03Three training datasets are introduced: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k.
InterleaveThinker addresses a fundamental architectural limitation of current image generators: their inability to produce interleaved sequences of text and images, which the paper identifies as critical for visual narratives, guidance, and embodied manipulation. Even the most capable open-source Unified Multimodal Models (UMMs) perform poorly on this task. The proposed solution is a two-agent pipeline — a planner agent that organizes the image-text input sequence and issues step-by-step instructions to the image generator, and a critic agent that evaluates each output, identifies samples deviating from the plan, and refines instructions to trigger regeneration.
Because a single interleaved generation trajectory can involve more than 25 generator calls, optimizing the full trajectory end-to-end is computationally impractical.
Training the pipeline involves a format cold-start phase using two supervised fine-tuning datasets (Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k), followed by reinforcement learning on Interleave-Critic-RL-13k. Because a single interleaved generation trajectory can involve more than 25 generator calls, optimizing the full trajectory end-to-end is computationally impractical. The paper addresses this with accuracy reward and step-wise reward formulations that allow single-step RL — specifically GRPO — to effectively guide the entire trajectory.
Benchmark results show InterleaveThinker achieves performance comparable to Nano Banana and GPT-5 on interleaved generation benchmarks. Notably, the pipeline also produces substantial gains on reasoning-based benchmarks WISE and RISE when applied to 4-step FLUX.2-klein, a result the authors describe as surprising given the pipeline was not explicitly designed for reasoning tasks.
Key facts
- 01InterleaveThinker is described as the first multi-agent pipeline to add interleaved text-image generation to any existing image generator.
- 02The pipeline uses two agents: a planner agent to organize and instruct, and a critic agent to evaluate outputs and refine instructions for regeneration.
- 03Three training datasets are introduced: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k.
- 04Reinforcement learning uses GRPO with accuracy reward and step-wise reward to make single-step RL practical across long trajectories.
- 05A single interleaved generation trajectory can involve over 25 generator calls, making full-trajectory optimization computationally impractical.
- 06On interleaved generation benchmarks, InterleaveThinker achieves performance comparable to Nano Banana and GPT-5.
- 07Applied to 4-step FLUX.2-klein, the pipeline yields substantial gains on reasoning-based benchmarks WISE and RISE.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →