Apr 23, 2026·1 min readTutorials & How-To

Python module lets devs A/B test prompts inside Claude Code

Frank Brsrk built an open-source Python module that forks any prompt through two `gpt-4o` agents — one baseline, one augmented with an Ejentum reasoning scaffold — then uses a Gemini Flash judge to score and verdict the outputs, all runnable inside Claude Code in about 5 minutes.

Dev.to #ai·Frank Brsrk

Read at source

Composite

5.2

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers iterating on system prompts inside Claude Code or similar IDE agents can use this module to get an objective, reproducible verdict on whether a prompt change actually improves reasoning — rather than relying on subjective impression.

01The module forks any prompt through two `gpt-4o` agents at temperature 0 — one baseline, one augmented with an Ejentum reasoning scaffold injected via a function-call tool at runtime.
02A blind Gemini Flash judge scores both responses on five dimensions and returns a verdict of 'A', 'B', or 'tie', along with a quoted rationale.
03In a medical second-opinion demo, the scaffolded agent scored 20 vs. 16 for the baseline, with the judge citing its identification of sluggishness as a symptom and recommendation of thyroid testing.

Summary— our read of the original

Frank Brsrk built this module to solve a personal problem: as a solo founder dogfooding Claude Code, he had no reliable way to verify whether changes to a system prompt were producing genuinely better reasoning or just differently formatted output. The tool forks any prompt through two `gpt-4o` agents at temperature 0 — a baseline agent and one augmented with an Ejentum reasoning scaffold, delivered via a function-call tool that retrieves a structured reasoning constraint set at runtime. A blind Gemini Flash judge then evaluates both responses on five dimensions, seeing only neutral A/B labels, and returns a structured JSON verdict with a quoted rationale.

The article illustrates the approach with a medical second-opinion prompt (a 45-year-old patient with pre-diabetic markers, dyslipidemia, and vitamin D deficiency).

The article illustrates the approach with a medical second-opinion prompt (a 45-year-old patient with pre-diabetic markers, dyslipidemia, and vitamin D deficiency). The baseline agent gave a general dietary recommendation; the scaffolded agent identified the patient's reported symptom of sluggishness and suggested a thyroid panel. The Gemini Flash judge scored the scaffolded response 20 to 16 and cited its superior diagnostic reasoning. The module is designed around three transparency principles: full trace visibility of every step, auditability via published markdown system prompts in the repo, and verifiability by anyone with API keys who can clone and re-run the same script.

The author also addresses ties explicitly: on low-complexity, single-turn prompts where `gpt-4o` already performs well, both agents will produce similar responses and the judge will tie them — which the author frames as a meaningful signal rather than a tool failure. The scaffold's advantage surfaces specifically on prompts where baseline `gpt-4o` exhibits failure modes such as sycophancy, shallow single-cause framing, generic templated responses, or missed differential diagnosis on ambiguous data.

Key facts

01The module forks any prompt through two `gpt-4o` agents at temperature 0 — one baseline, one augmented with an Ejentum reasoning scaffold injected via a function-call tool at runtime.
02A blind Gemini Flash judge scores both responses on five dimensions and returns a verdict of 'A', 'B', or 'tie', along with a quoted rationale.
03In a medical second-opinion demo, the scaffolded agent scored 20 vs. 16 for the baseline, with the judge citing its identification of sluggishness as a symptom and recommendation of thyroid testing.
04All three system prompts (baseline, augmented, evaluator) are published as readable markdown in the repo for full auditability.
05The judge intentionally runs on a different model family (Gemini) than the producers (OpenAI) to reduce bias.
06The module can be run directly via `python orchestrator.py` or delegated to an IDE agent like Claude Code, Cursor, or Antigravity.
07Ties are treated as a real signal: they indicate the prompt doesn't stress the failure modes the scaffold is designed to prevent.

Topics

#prompt-engineering #agentic-coding #ide-integration #benchmarks #tool-use

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 23, 2026 · 11:04 UTC. How this works →

Python module lets devs A/B test prompts inside Claude Code

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics