Apr 23, 2026·1 min readResearch Papers

Blind A/B test finds only 7 of 40 Claude prompt codes shift reasoning

Samarth Bhamare blind-tested 40 popular Claude prompt codes across coding, analysis, and writing tasks and found only 7 measurably changed Claude's reasoning — the rest just altered its tone.

Dev.to #claude·Samarth Bhamare

Read at source

Composite

6.3

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Practitioners can stop wasting time on hyped prompt codes like `GODMODE` and `BEASTMODE`, and instead focus on the 7 empirically validated codes — especially `/skeptic` and `L99` — to meaningfully change Claude's reasoning behavior rather than just its tone.

01Samarth Bhamare blind A/B tested 40 Claude prompt codes over three months, with 12–20 runs per code and blind-rated outputs.
02Only 7 of 40 codes measurably shifted Claude's reasoning; the other 33 changed tone or voice only.
03The 7 codes with real signal: `/skeptic`, `L99`, `/blindspots`, `/crit`, `ULTRATHINK`, `/deep`, and `/premortem`.

Summary— our read of the original

Samarth Bhamare conducted a three-month blind A/B testing study of 40 Claude prompt codes — the `L99`, `/skeptic`, `GODMODE`, `ULTRATHINK`-style strings that circulate widely on Reddit and Twitter. Each code was tested with a fresh context window, across fixed task batteries spanning coding, analysis, and writing, with 12–20 runs per code and blind-rated outputs. The full findings are written up in a 31-page PDF available free without an email wall.

The central finding is that only 7 of the 40 codes produced measurable changes in Claude's reasoning.

The central finding is that only 7 of the 40 codes produced measurable changes in Claude's reasoning. The remaining 33 altered tone and presentation — less hedging, different voice — but left the underlying reasoning unchanged. The 7 codes with real signal are `/skeptic`, `L99`, `/blindspots`, `/crit`, `ULTRATHINK`, `/deep`, and `/premortem`. The most striking effect sizes: `/skeptic` caught wrong-premise questions 79% of the time versus a 14% baseline, and `L99` produced a committed answer 11 out of 12 times versus 2 out of 12 at baseline. Codes that sounded powerful but measured as noise include `GODMODE`, `BEASTMODE`, most jailbreak variants, and most "you are an expert in X" prefixes.

Bhamare acknowledges the study's limitations: small sample sizes, single rater, and model drift over time. All tests were conducted on Opus 4.6, Sonnet 4.5, and Haiku 4.5 as of March 2026. He notes that effect sizes on the 7 real codes are large enough to be meaningful despite the small n — a 79% vs. 14% difference is not ambiguous — but recommends treating borderline cases as "indistinguishable from baseline at this n" rather than definitively fake.

Key facts

01Samarth Bhamare blind A/B tested 40 Claude prompt codes over three months, with 12–20 runs per code and blind-rated outputs.
02Only 7 of 40 codes measurably shifted Claude's reasoning; the other 33 changed tone or voice only.
03The 7 codes with real signal: `/skeptic`, `L99`, `/blindspots`, `/crit`, `ULTRATHINK`, `/deep`, and `/premortem`.
04`/skeptic` caught wrong-premise questions 79% of the time vs. a 14% baseline.
05`L99` produced a committed answer 11 out of 12 times vs. 2 out of 12 at baseline.
06Codes measured as noise include `GODMODE`, `BEASTMODE`, most jailbreak variants, and most 'you are an expert in X' prefixes.
07All tests were run on Opus 4.6, Sonnet 4.5, and Haiku 4.5 as of March 2026; the 31-page report is free with no email wall.

Topics

#prompt-engineering #benchmarks #claude #reasoning #empirical-study

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 23, 2026 · 11:04 UTC. How this works →

Apr 23, 2026·1 min readResearch Papers

Blind A/B test finds only 7 of 40 Claude prompt codes shift reasoning

Samarth Bhamare blind-tested 40 popular Claude prompt codes across coding, analysis, and writing tasks and found only 7 measurably changed Claude's reasoning — the rest just altered its tone.

Dev.to #claude·Samarth Bhamare

Read at source

Composite

6.3

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Samarth Bhamare blind A/B tested 40 Claude prompt codes over three months, with 12–20 runs per code and blind-rated outputs.
02Only 7 of 40 codes measurably shifted Claude's reasoning; the other 33 changed tone or voice only.
03The 7 codes with real signal: `/skeptic`, `L99`, `/blindspots`, `/crit`, `ULTRATHINK`, `/deep`, and `/premortem`.

Summary— our read of the original

The central finding is that only 7 of the 40 codes produced measurable changes in Claude's reasoning.

Key facts

01Samarth Bhamare blind A/B tested 40 Claude prompt codes over three months, with 12–20 runs per code and blind-rated outputs.
02Only 7 of 40 codes measurably shifted Claude's reasoning; the other 33 changed tone or voice only.
03The 7 codes with real signal: `/skeptic`, `L99`, `/blindspots`, `/crit`, `ULTRATHINK`, `/deep`, and `/premortem`.
04`/skeptic` caught wrong-premise questions 79% of the time vs. a 14% baseline.
05`L99` produced a committed answer 11 out of 12 times vs. 2 out of 12 at baseline.
06Codes measured as noise include `GODMODE`, `BEASTMODE`, most jailbreak variants, and most 'you are an expert in X' prefixes.
07All tests were run on Opus 4.6, Sonnet 4.5, and Haiku 4.5 as of March 2026; the 31-page report is free with no email wall.

Topics

#prompt-engineering #benchmarks #claude #reasoning #empirical-study

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics