Jun 11, 2026·1 min readResearch Papers

InterleaveThinker brings interleaved text-image generation to any image model

InterleaveThinker is a multi-agent pipeline that adds interleaved text-image generation capabilities to existing image generators using a planner agent, a critic agent, and reinforcement learning via GRPO.

HuggingFace Papers

Read at source

Composite

5.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

InterleaveThinker removes the architectural barrier that has prevented existing image generators from producing interleaved text-image sequences, extending a capability previously limited to frontier models like GPT-5 to any image generator via a plug-in multi-agent pipeline.

01InterleaveThinker is described as the first multi-agent pipeline to add interleaved text-image generation to any existing image generator.
02The pipeline uses two agents: a planner agent to organize and instruct, and a critic agent to evaluate outputs and refine instructions for regeneration.
03Three training datasets are introduced: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k.

Summary— our read of the original

InterleaveThinker addresses a fundamental architectural limitation of current image generators: their inability to produce interleaved sequences of text and images, which the paper identifies as critical for visual narratives, guidance, and embodied manipulation. Even the most capable open-source Unified Multimodal Models (UMMs) perform poorly on this task. The proposed solution is a two-agent pipeline — a planner agent that organizes the image-text input sequence and issues step-by-step instructions to the image generator, and a critic agent that evaluates each output, identifies samples deviating from the plan, and refines instructions to trigger regeneration.

Because a single interleaved generation trajectory can involve more than 25 generator calls, optimizing the full trajectory end-to-end is computationally impractical.

Training the pipeline involves a format cold-start phase using two supervised fine-tuning datasets (Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k), followed by reinforcement learning on Interleave-Critic-RL-13k. Because a single interleaved generation trajectory can involve more than 25 generator calls, optimizing the full trajectory end-to-end is computationally impractical. The paper addresses this with accuracy reward and step-wise reward formulations that allow single-step RL — specifically GRPO — to effectively guide the entire trajectory.

Benchmark results show InterleaveThinker achieves performance comparable to Nano Banana and GPT-5 on interleaved generation benchmarks. Notably, the pipeline also produces substantial gains on reasoning-based benchmarks WISE and RISE when applied to 4-step FLUX.2-klein, a result the authors describe as surprising given the pipeline was not explicitly designed for reasoning tasks.

Key facts

01InterleaveThinker is described as the first multi-agent pipeline to add interleaved text-image generation to any existing image generator.
02The pipeline uses two agents: a planner agent to organize and instruct, and a critic agent to evaluate outputs and refine instructions for regeneration.
03Three training datasets are introduced: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k.
04Reinforcement learning uses GRPO with accuracy reward and step-wise reward to make single-step RL practical across long trajectories.
05A single interleaved generation trajectory can involve over 25 generator calls, making full-trajectory optimization computationally impractical.
06On interleaved generation benchmarks, InterleaveThinker achieves performance comparable to Nano Banana and GPT-5.
07Applied to 4-step FLUX.2-klein, the pipeline yields substantial gains on reasoning-based benchmarks WISE and RISE.

Topics

#multi-agent #agent-framework #reinforcement-learning #benchmarks #reasoning

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 12, 2026 · 10:05 UTC. How this works →

Jun 11, 2026·1 min readResearch Papers

InterleaveThinker brings interleaved text-image generation to any image model

HuggingFace Papers

Read at source

Composite

5.4

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01InterleaveThinker is described as the first multi-agent pipeline to add interleaved text-image generation to any existing image generator.
02The pipeline uses two agents: a planner agent to organize and instruct, and a critic agent to evaluate outputs and refine instructions for regeneration.
03Three training datasets are introduced: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k.

Summary— our read of the original

Because a single interleaved generation trajectory can involve more than 25 generator calls, optimizing the full trajectory end-to-end is computationally impractical.

Key facts

01InterleaveThinker is described as the first multi-agent pipeline to add interleaved text-image generation to any existing image generator.
02The pipeline uses two agents: a planner agent to organize and instruct, and a critic agent to evaluate outputs and refine instructions for regeneration.
03Three training datasets are introduced: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k.
04Reinforcement learning uses GRPO with accuracy reward and step-wise reward to make single-step RL practical across long trajectories.
05A single interleaved generation trajectory can involve over 25 generator calls, making full-trajectory optimization computationally impractical.
06On interleaved generation benchmarks, InterleaveThinker achieves performance comparable to Nano Banana and GPT-5.
07Applied to 4-step FLUX.2-klein, the pipeline yields substantial gains on reasoning-based benchmarks WISE and RISE.

Topics

#multi-agent #agent-framework #reinforcement-learning #benchmarks #reasoning

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.