Apr 17, 2026·1 min readTutorials & How-To

OpenHands webinar explains concrete scoring for critic agents

An OpenHands webinar clip explains how replacing vague LLM-as-judge scoring criteria with concrete, violation-based deductions makes critic agent scores consistent, actionable, and trustworthy.

YouTube: OpenHands·OpenHands

Read at source

Composite

5.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Teams building agentic code-review or migration pipelines can adopt violation-based deduction scoring to get stable, auditable critic signals that reliably guide agents toward correct, style-compliant output.

01Vague scoring categories (e.g., 'style', 'cleanliness') cause high run-to-run variability in LLM-as-judge workflows.
02Concrete scoring instructs the critic agent to deduct one point per identifiable violation rather than assign holistic scores.
03Tying score deductions to specific code locations makes feedback actionable for the next step in the workflow.

Summary— our read of the original

The OpenHands webinar clip addresses a widespread problem in agentic workflows: when critic agents are asked to score outputs using vague categories such as "style," "functionality," or "cleanliness" on a 0–100 scale without explicit definitions, the scores vary significantly from run to run. Because the agent is free to interpret those criteria however it sees fit in a given context, a second evaluation of the same code can produce an entirely different result, making it impossible to trust whether progress is real.

The proposed fix is concrete scoring: define precisely what good code and bad code look like, then instruct the critic agent to deduct one point for every violation it identifies.

The proposed fix is concrete scoring: define precisely what good code and bad code look like, then instruct the critic agent to deduct one point for every violation it identifies. This ties each score deduction to a specific, locatable issue in the code, which is necessary for generating actionable feedback for downstream steps. The clip also highlights two specific dimensions worth scoring explicitly — completion (whether a Java migration fully covers all business logic from the Cobalt source) and correctness (whether it faithfully reproduces that logic) — while noting this is not a replacement for traditional testing.

Beyond business logic, the same framework can enforce company style guides and architectural patterns: any deviation from a provided style guide counts as a violation and triggers a point deduction. Because every score drop is traceable to a concrete bad pattern, the critic can also be used for counterfactual prompting — generating feedback that, if followed, would directly increase the score — creating a measurable, explainable improvement loop.

Key facts

01Vague scoring categories (e.g., 'style', 'cleanliness') cause high run-to-run variability in LLM-as-judge workflows.
02Concrete scoring instructs the critic agent to deduct one point per identifiable violation rather than assign holistic scores.
03Tying score deductions to specific code locations makes feedback actionable for the next step in the workflow.
04Completion is defined as a migration fully covering all business logic from the source; correctness means faithfully reproducing that logic.
05Concrete scoring is described as complementary to, not a replacement for, traditional testing.
06Company style guides and architectural patterns can be provided as documentation, with any deviation counting as a scoreable violation.
07Consistent, low-variability scores enable counterfactual prompting: generating feedback that, if followed, would raise the score.

Topics

#agent-framework #prompt-engineering #multi-agent #developer-tools #reasoning

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 23, 2026 · 11:04 UTC. How this works →

Apr 17, 2026·1 min readTutorials & How-To

OpenHands webinar explains concrete scoring for critic agents

An OpenHands webinar clip explains how replacing vague LLM-as-judge scoring criteria with concrete, violation-based deductions makes critic agent scores consistent, actionable, and trustworthy.

YouTube: OpenHands·OpenHands

Read at source

Composite

5.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Vague scoring categories (e.g., 'style', 'cleanliness') cause high run-to-run variability in LLM-as-judge workflows.
02Concrete scoring instructs the critic agent to deduct one point per identifiable violation rather than assign holistic scores.
03Tying score deductions to specific code locations makes feedback actionable for the next step in the workflow.

Summary— our read of the original

The proposed fix is concrete scoring: define precisely what good code and bad code look like, then instruct the critic agent to deduct one point for every violation it identifies.

Key facts

01Vague scoring categories (e.g., 'style', 'cleanliness') cause high run-to-run variability in LLM-as-judge workflows.
02Concrete scoring instructs the critic agent to deduct one point per identifiable violation rather than assign holistic scores.
03Tying score deductions to specific code locations makes feedback actionable for the next step in the workflow.
04Completion is defined as a migration fully covering all business logic from the source; correctness means faithfully reproducing that logic.
05Concrete scoring is described as complementary to, not a replacement for, traditional testing.
06Company style guides and architectural patterns can be provided as documentation, with any deviation counting as a scoreable violation.
07Consistent, low-variability scores enable counterfactual prompting: generating feedback that, if followed, would raise the score.

Topics

#agent-framework #prompt-engineering #multi-agent #developer-tools #reasoning

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics