Study coins "Instruction-Tuning Tax" for code LLMs
A new empirical study by Chang, Ho, and Li finds that instruction tuning improves LLMs' ability to follow natural-language commands but degrades their code infilling performance — a trade-off the authors term the "Instruction-Tuning Tax."
Score breakdown
The study reveals that a single instruction-tuned model cannot optimally serve both Flow and Command coding modes simultaneously, highlighting a concrete design tension that the authors argue must be carefully balanced in AI-powered coding assistant development.
- 01The paper introduces the term 'Instruction-Tuning Tax' to describe the trade-off caused by instruction tuning across programming modes.
- 02Developers interact with code in two modes: Flow (code completion/infilling) and Command (natural-language instructions to code).
- 03Instruction-tuned models are better at following instructions and leveraging structured guidance.
Shi Ying Chang, Chiok Yew Ho, and Yichen Li introduce the concept of the "Instruction-Tuning Tax" to describe a key trade-off they uncover in code-focused large language models. Their paper frames developer interaction with AI coding assistants around two cognitive modes: **Flow mode**, where developers need tools that directly complete or infill code in unfinished programs, and **Command mode**, where developers express intent in natural language and expect executable code in return. While instruction-tuned LLMs have come to dominate many application scenarios due to their intent-inference capabilities, the authors argue it was previously unclear whether this paradigm is equally suitable for both modes.
To fill that gap, the paper conducts what it describes as the first empirical study on this trade-off.
To fill that gap, the paper conducts what it describes as the first empirical study on this trade-off. The methodology combines qualitative and quantitative approaches: manual failure categorization, behavioral metrics that capture generation fidelity, and evaluation of intermediate checkpoints throughout the tuning process. The results show that instruction-tuned models gain stronger instruction-following and structured-guidance abilities, but these gains come at the cost of weaker infilling performance. The study synthesizes its findings into seven discrete findings and four implications, offering guidance on how to balance instruction-following ability with effective code generation assistance in the development of AI-powered coding tools.
Key facts
- 01The paper introduces the term 'Instruction-Tuning Tax' to describe the trade-off caused by instruction tuning across programming modes.
- 02Developers interact with code in two modes: Flow (code completion/infilling) and Command (natural-language instructions to code).
- 03Instruction-tuned models are better at following instructions and leveraging structured guidance.
- 04Those same instruction-tuned models show weaker infilling performance compared to their base counterparts.
- 05The study is described as the first empirical investigation of this trade-off in code LLMs.
- 06Methodology includes manual failure categorization, behavioral metrics for generation fidelity, and intermediate-checkpoint evaluation.
- 07Results are summarized into seven findings and four implications for AI coding tool development.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →