Python module lets devs A/B test prompts inside Claude Code
Frank Brsrk built an open-source Python module that forks any prompt through two `gpt-4o` agents — one baseline, one augmented with an Ejentum reasoning scaffold — then uses a Gemini Flash judge to score and verdict the outputs, all runnable inside Claude Code in about 5 minutes.
Score breakdown
Developers iterating on system prompts inside Claude Code or similar IDE agents can use this module to get an objective, reproducible verdict on whether a prompt change actually improves reasoning — rather than relying on subjective impression.
- 01The module forks any prompt through two `gpt-4o` agents at temperature 0 — one baseline, one augmented with an Ejentum reasoning scaffold injected via a function-call tool at runtime.
- 02A blind Gemini Flash judge scores both responses on five dimensions and returns a verdict of 'A', 'B', or 'tie', along with a quoted rationale.
- 03In a medical second-opinion demo, the scaffolded agent scored 20 vs. 16 for the baseline, with the judge citing its identification of sluggishness as a symptom and recommendation of thyroid testing.
Frank Brsrk built this module to solve a personal problem: as a solo founder dogfooding Claude Code, he had no reliable way to verify whether changes to a system prompt were producing genuinely better reasoning or just differently formatted output. The tool forks any prompt through two `gpt-4o` agents at temperature 0 — a baseline agent and one augmented with an Ejentum reasoning scaffold, delivered via a function-call tool that retrieves a structured reasoning constraint set at runtime. A blind Gemini Flash judge then evaluates both responses on five dimensions, seeing only neutral A/B labels, and returns a structured JSON verdict with a quoted rationale.
The article illustrates the approach with a medical second-opinion prompt (a 45-year-old patient with pre-diabetic markers, dyslipidemia, and vitamin D deficiency).
The article illustrates the approach with a medical second-opinion prompt (a 45-year-old patient with pre-diabetic markers, dyslipidemia, and vitamin D deficiency). The baseline agent gave a general dietary recommendation; the scaffolded agent identified the patient's reported symptom of sluggishness and suggested a thyroid panel. The Gemini Flash judge scored the scaffolded response 20 to 16 and cited its superior diagnostic reasoning. The module is designed around three transparency principles: full trace visibility of every step, auditability via published markdown system prompts in the repo, and verifiability by anyone with API keys who can clone and re-run the same script.
The author also addresses ties explicitly: on low-complexity, single-turn prompts where `gpt-4o` already performs well, both agents will produce similar responses and the judge will tie them — which the author frames as a meaningful signal rather than a tool failure. The scaffold's advantage surfaces specifically on prompts where baseline `gpt-4o` exhibits failure modes such as sycophancy, shallow single-cause framing, generic templated responses, or missed differential diagnosis on ambiguous data.
Key facts
- 01The module forks any prompt through two `gpt-4o` agents at temperature 0 — one baseline, one augmented with an Ejentum reasoning scaffold injected via a function-call tool at runtime.
- 02A blind Gemini Flash judge scores both responses on five dimensions and returns a verdict of 'A', 'B', or 'tie', along with a quoted rationale.
- 03In a medical second-opinion demo, the scaffolded agent scored 20 vs. 16 for the baseline, with the judge citing its identification of sluggishness as a symptom and recommendation of thyroid testing.
- 04All three system prompts (baseline, augmented, evaluator) are published as readable markdown in the repo for full auditability.