Three-axis uncertainty framework lifts code LLM reliability by 8.1 AUROC points
Researchers Yuling Shi, Caiqi Zhang, and Yuexian Li propose a code-specific uncertainty estimation framework with three orthogonal axes — lexical, algorithmic, and functional — that improves average AUROC from 0.696 to 0.776 over the strongest natural-language-derived baseline across five code LLMs.
Score breakdown
The work demonstrates that code-specific uncertainty estimation — rather than methods borrowed from natural language — meaningfully improves the ability to detect silently wrong programs, which is directly relevant to safe deployment of LLMs in agentic coding pipelines.
- 01The paper identifies three code-specific properties: token fragility, the intent-code gap, and executability.
- 02These properties map to three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency).
- 03The three-axis ensemble improves average AUROC from 0.696 to 0.776 (+8.1 points) over the strongest NL-derived baseline across five code LLMs.
Yuling Shi, Caiqi Zhang, and Yuexian Li present a framework that challenges the common practice of applying natural language (NL) uncertainty estimation (UE) methods directly to code generation. The paper argues that code differs from NL in three fundamental ways: token fragility (a single wrong token can break an entire program), the intent-code gap (algorithmic intent and concrete implementation can disagree independently), and executability (programs can be run to observe behavior). Silently wrong programs generated by LLMs pose real safety and reliability risks, making reliable UE essential for selective prediction, human-in-the-loop review, and downstream agentic decisions.
Evaluated across five code LLMs, their three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776, a gain of +8.1 points.
The authors translate these three properties into three orthogonal uncertainty axes: lexical uncertainty measured via Top-K token entropy, algorithmic uncertainty measured via pseudo-code consistency, and functional uncertainty measured via behavioral consistency. Evaluated across five code LLMs, their three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776, a gain of +8.1 points. Notably, on Qwen3-14B, the single-pass Top-K token entropy method alone matches the strongest multi-pass baseline while being over 3x cheaper to compute, and it remains a competitive low-cost signal across models. The results support the paper's central thesis that code UE deserves code-specific design rather than direct NL ports.
Key facts
- 01The paper identifies three code-specific properties: token fragility, the intent-code gap, and executability.
- 02These properties map to three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency).
- 03The three-axis ensemble improves average AUROC from 0.696 to 0.776 (+8.1 points) over the strongest NL-derived baseline across five code LLMs.
- 04On Qwen3-14B, single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper.
- 05The authors argue that existing code UE methods are largely inherited from NL generation and ignore code-specific properties.
- 06Reliable UE is framed as essential for selective prediction, human-in-the-loop review, and downstream agentic decisions.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →