Agent pipeline formalizes numerical analysis, exposing flaws kernel acceptance misses
A paper by Theodore Meek, Siyuan Ge, and Di Qiu Xiang applies a coding agent to formalize a numerical analysis textbook in Lean 4 and introduces a three-dimensional quality framework that reveals systematic flaws hidden by standard compilation-based metrics.
Score breakdown
Compilation-based metrics, the current standard for judging autoformalization agents, are shown to substantially overstate quality by missing semantic errors that a reproducible three-dimensional audit framework can detect.
- 01Authors: Theodore Meek, Siyuan Ge, Di Qiu Xiang (ArXiv, 2026-06-12)
- 02The coding agent formalizes *Numerical Methods for Ordinary Differential Equations* in Lean 4, a textbook largely absent from mathlib
- 03Existing autoformalization efforts measure success solely through kernel acceptance (compilation)
Theodore Meek, Siyuan Ge, and Di Qiu Xiang address two gaps in the autoformalization literature: existing coding-agent efforts concentrate on branches of mathematics already well-represented in mathlib, and they measure success solely through kernel acceptance (i.e., whether Lean 4's kernel compiles the proof). The paper applies a coding agent to formalize *Numerical Methods for Ordinary Differential Equations*, a textbook in numerical analysis that is largely absent from mathlib, deliberately choosing a domain where the agent must develop new theory from scratch rather than reuse existing library infrastructure.
They apply this framework both to their own formalization and to the publicly released outputs of RepoProver and M2F.
To evaluate quality beyond compilation, the authors introduce a systematic, reproducible three-dimensional framework covering semantic correctness, Mathlib reuse, and cross-file reuse, all assessed via LLM-as-judge methods. They apply this framework both to their own formalization and to the publicly released outputs of RepoProver and M2F. The audit uncovers recurring unfaithful formalization patterns — incomplete multi-part statements, added weakening hypotheses, and parameter restrictions — none of which are caught by kernel acceptance alone.
The central finding is that compilation-based metrics substantially overstate formalization quality. The authors provide their audit methodology as a reproducible tool intended to support more rigorous evaluation of future autoformalization systems, shifting the field's standard of success beyond the binary pass/fail of kernel acceptance.
Key facts
- 01Authors: Theodore Meek, Siyuan Ge, Di Qiu Xiang (ArXiv, 2026-06-12)
- 02The coding agent formalizes *Numerical Methods for Ordinary Differential Equations* in Lean 4, a textbook largely absent from mathlib
- 03Existing autoformalization efforts measure success solely through kernel acceptance (compilation)
- 04The paper introduces a three-dimensional quality framework: semantic correctness, Mathlib reuse, and cross-file reuse
- 05Quality evaluation uses LLM-as-judge methods and is designed to be reproducible
- 06The framework is applied to the authors' own formalization and to released outputs of RepoProver and M2F
- 07Recurring unfaithful patterns found include incomplete multi-part statements, added weakening hypotheses, and parameter restrictions
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 15, 2026 · 11:57 UTC. How this works →