OpenHands webinar explains concrete scoring for critic agents
An OpenHands webinar clip explains how replacing vague LLM-as-judge scoring criteria with concrete, violation-based deductions makes critic agent scores consistent, actionable, and trustworthy.
Score breakdown
Teams building agentic code-review or migration pipelines can adopt violation-based deduction scoring to get stable, auditable critic signals that reliably guide agents toward correct, style-compliant output.
- 01Vague scoring categories (e.g., 'style', 'cleanliness') cause high run-to-run variability in LLM-as-judge workflows.
- 02Concrete scoring instructs the critic agent to deduct one point per identifiable violation rather than assign holistic scores.
- 03Tying score deductions to specific code locations makes feedback actionable for the next step in the workflow.
The OpenHands webinar clip addresses a widespread problem in agentic workflows: when critic agents are asked to score outputs using vague categories such as "style," "functionality," or "cleanliness" on a 0–100 scale without explicit definitions, the scores vary significantly from run to run. Because the agent is free to interpret those criteria however it sees fit in a given context, a second evaluation of the same code can produce an entirely different result, making it impossible to trust whether progress is real.
The proposed fix is concrete scoring: define precisely what good code and bad code look like, then instruct the critic agent to deduct one point for every violation it identifies.
The proposed fix is concrete scoring: define precisely what good code and bad code look like, then instruct the critic agent to deduct one point for every violation it identifies. This ties each score deduction to a specific, locatable issue in the code, which is necessary for generating actionable feedback for downstream steps. The clip also highlights two specific dimensions worth scoring explicitly — completion (whether a Java migration fully covers all business logic from the Cobalt source) and correctness (whether it faithfully reproduces that logic) — while noting this is not a replacement for traditional testing.
Beyond business logic, the same framework can enforce company style guides and architectural patterns: any deviation from a provided style guide counts as a violation and triggers a point deduction. Because every score drop is traceable to a concrete bad pattern, the critic can also be used for counterfactual prompting — generating feedback that, if followed, would directly increase the score — creating a measurable, explainable improvement loop.
Key facts
- 01Vague scoring categories (e.g., 'style', 'cleanliness') cause high run-to-run variability in LLM-as-judge workflows.
- 02Concrete scoring instructs the critic agent to deduct one point per identifiable violation rather than assign holistic scores.
- 03Tying score deductions to specific code locations makes feedback actionable for the next step in the workflow.
- 04Completion is defined as a migration fully covering all business logic from the source; correctness means faithfully reproducing that logic.
- 05Concrete scoring is described as complementary to, not a replacement for, traditional testing.
- 06Company style guides and architectural patterns can be provided as documentation, with any deviation counting as a scoreable violation.