Agentic judging pipeline boosts code LLM architectural reasoning by up to 540%
Researchers Kirill Vasilevski, Ximing Dong, and Benjamin Rombaut propose an agentic judging pipeline that uses a strong LLM to evaluate architectural quality in code, enabling fine-tuned Qwen3 models to achieve up to 27.2% resolved rates on SWE-bench Verified — up to 540% over the base model.
Score breakdown
The pipeline replaces prohibitively expensive manual architectural labeling with a scalable agentic approach, enabling fine-tuned models to achieve dramatically higher SWE-bench Verified resolved rates than either the base model or unfiltered fine-tuning.
- 01Authors: Kirill Vasilevski, Ximing Dong, and Benjamin Rombaut (published 2026-06-12 on ArXiv).
- 02The pipeline uses two LLM judges: the Architecture Complexity Judge (ACJ) and the Architecture Quality Judge (AQJ).
- 03ACJ estimates how much architectural understanding a task demands; AQJ evaluates patch conformance to repository-specific conventions via source-grounded rubrics.
A paper by Kirill Vasilevski, Ximing Dong, and Benjamin Rombaut identifies a gap in how code LLMs are trained and evaluated: while these models have substantially improved software engineering tasks, they are optimized for correctness rather than architectural understanding — the kind of judgment needed to write patches that conform to a repository's design conventions. Manual labeling of architectural quality is prohibitively expensive, and test-based verification cannot capture it, leaving a scalability problem for training data curation.
To close this gap, the authors introduce an agentic judging pipeline that uses a strong LLM as a scalable proxy for expert architectural evaluation.
To close this gap, the authors introduce an agentic judging pipeline that uses a strong LLM as a scalable proxy for expert architectural evaluation. The pipeline comprises two judges: the Architecture Complexity Judge (ACJ) estimates how much codebase-specific architectural understanding a task demands, while the Architecture Quality Judge (AQJ) evaluates patch conformance to repository-specific architectural conventions through source-grounded rubrics. Using this pipeline, the authors curate 3,360 training instances and fine-tune Qwen3 models at three scales — 8B, 14B, and 32B parameters.
The resulting models achieve resolved rates of up to 27.2% on SWE-bench Verified, representing up to 540% improvement over the base model and 256% over unfiltered fine-tuning. The paper also reports strong cross-language generalization and consistent improvements in architectural patch quality across the trained models.
Key facts
- 01Authors: Kirill Vasilevski, Ximing Dong, and Benjamin Rombaut (published 2026-06-12 on ArXiv).
- 02The pipeline uses two LLM judges: the Architecture Complexity Judge (ACJ) and the Architecture Quality Judge (AQJ).
- 03ACJ estimates how much architectural understanding a task demands; AQJ evaluates patch conformance to repository-specific conventions via source-grounded rubrics.
- 04Fine-tuning used 3,360 curated instances on Qwen3-8B, 14B, and 32B models.
- 05Best resolved rate on SWE-bench Verified: 27.2%.
- 06That result is up to 540% over the base model and 256% over unfiltered fine-tuning.
- 07Trained models show strong cross-language generalization and consistent architectural patch quality improvements.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →