ManimTrainer pipeline achieves 94% render success for LLM animation generation
Researchers introduce ManimTrainer and ManimAgent, a combined training and agentic inference system for generating Manim animations from text, with the best model achieving a 94% Render Success Rate and 85.7% Visual Similarity — outperforming GPT-4.1.
Score breakdown
Developers building AI-powered educational or presentation tools can use the ManimTrainer/ManimAgent framework as a blueprint for combining fine-tuning and agentic inference to reliably generate high-quality programmatic animations from text prompts.
- 01The paper introduces ManimTrainer, combining Supervised Fine-tuning (SFT) with Reinforcement Learning via Group Relative Policy Optimisation (GRPO) and a unified code-and-visual reward signal.
- 02ManimAgent provides two inference strategies: Renderer-in-the-loop (RITL) and API documentation-augmented RITL-DOC.
- 0317 open-source sub-30B LLMs were evaluated across nine training and inference strategy combinations using ManimBench.
Ravidu Suien Rammuni Silva, Ahmad Lotfi, and Isibor Kennedy Ihianle address a gap in LLM research: the systematic study of how training and inference strategies interact when generating programmatic animations with Manim. The task is particularly demanding because it requires spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. To tackle this, the authors introduce two complementary systems: ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning-based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals; and ManimAgent, an inference pipeline offering Renderer-in-the-loop (RITL) and API documentation-augmented RITL-DOC strategies.
The study evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench, making it the first unified benchmark of its kind for text-to-code-to-video transformation.
The study evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench, making it the first unified benchmark of its kind for text-to-code-to-video transformation. The top-performing configuration — Qwen 3 Coder 30B trained with GRPO and deployed with RITL-DOC — achieved a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS.
A key analytical finding is that SFT generally improves code quality metrics, while GRPO enhances visual outputs and increases model responsiveness to extrinsic correction signals at inference time. Notably, the correlation between code and visual metrics strengthens with SFT and GRPO but weakens when inference-time enhancements are applied, revealing that training and agentic inference strategies play complementary rather than redundant roles in this domain.
Key facts
- 01The paper introduces ManimTrainer, combining Supervised Fine-tuning (SFT) with Reinforcement Learning via Group Relative Policy Optimisation (GRPO) and a unified code-and-visual reward signal.
- 02ManimAgent provides two inference strategies: Renderer-in-the-loop (RITL) and API documentation-augmented RITL-DOC.
- 0317 open-source sub-30B LLMs were evaluated across nine training and inference strategy combinations using ManimBench.
- 04The best model — Qwen 3 Coder 30B with GRPO and RITL-DOC — achieved a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS).