Apr 20, 2026·1 min readResearch Papers

ManimTrainer pipeline achieves 94% render success for LLM animation generation

Researchers introduce ManimTrainer and ManimAgent, a combined training and agentic inference system for generating Manim animations from text, with the best model achieving a 94% Render Success Rate and 85.7% Visual Similarity — outperforming GPT-4.1.

ArXiv·Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle

Read at source

Composite

5.8

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers building AI-powered educational or presentation tools can use the ManimTrainer/ManimAgent framework as a blueprint for combining fine-tuning and agentic inference to reliably generate high-quality programmatic animations from text prompts.

01The paper introduces ManimTrainer, combining Supervised Fine-tuning (SFT) with Reinforcement Learning via Group Relative Policy Optimisation (GRPO) and a unified code-and-visual reward signal.
02ManimAgent provides two inference strategies: Renderer-in-the-loop (RITL) and API documentation-augmented RITL-DOC.
0317 open-source sub-30B LLMs were evaluated across nine training and inference strategy combinations using ManimBench.

Summary— our read of the original

Ravidu Suien Rammuni Silva, Ahmad Lotfi, and Isibor Kennedy Ihianle address a gap in LLM research: the systematic study of how training and inference strategies interact when generating programmatic animations with Manim. The task is particularly demanding because it requires spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. To tackle this, the authors introduce two complementary systems: ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning-based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals; and ManimAgent, an inference pipeline offering Renderer-in-the-loop (RITL) and API documentation-augmented RITL-DOC strategies.

The study evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench, making it the first unified benchmark of its kind for text-to-code-to-video transformation.

The study evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench, making it the first unified benchmark of its kind for text-to-code-to-video transformation. The top-performing configuration — Qwen 3 Coder 30B trained with GRPO and deployed with RITL-DOC — achieved a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS.

A key analytical finding is that SFT generally improves code quality metrics, while GRPO enhances visual outputs and increases model responsiveness to extrinsic correction signals at inference time. Notably, the correlation between code and visual metrics strengthens with SFT and GRPO but weakens when inference-time enhancements are applied, revealing that training and agentic inference strategies play complementary rather than redundant roles in this domain.

Key facts

01The paper introduces ManimTrainer, combining Supervised Fine-tuning (SFT) with Reinforcement Learning via Group Relative Policy Optimisation (GRPO) and a unified code-and-visual reward signal.
02ManimAgent provides two inference strategies: Renderer-in-the-loop (RITL) and API documentation-augmented RITL-DOC.
0317 open-source sub-30B LLMs were evaluated across nine training and inference strategy combinations using ManimBench.
04The best model — Qwen 3 Coder 30B with GRPO and RITL-DOC — achieved a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS).
05This configuration surpassed the baseline GPT-4.1 model by +3 percentage points in Visual Similarity.
06SFT improves code quality, while GRPO enhances visual outputs and model responsiveness to self-correction signals.
07The correlation between code and visual metrics strengthens with SFT/GRPO but weakens with inference-time enhancements, showing the two approaches are complementary.

Topics

#code-generation #fine-tuning #reinforcement-learning #benchmarks #agentic-inference

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 21, 2026 · 18:16 UTC. How this works →

ManimTrainer pipeline achieves 94% render success for LLM animation generation

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics