Apr 23, 2026·1 min readResearch Papers

Blind benchmark of 14 LLMs on WordPress plugin dev reveals gaps

A developer blind-tested 14 LLMs on a zero-shot WordPress plugin task, finding Claude 4.7 Opus the top scorer at 68/100 while every model failed to hook into native Gravity Forms UI elements.

Hacker News·guilamu

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

WordPress plugin developers replacing Copilot Pro's Opus access should explicitly prompt for native DOM integration and UX edge cases — no current LLM handles these implicitly, even the top-scoring Claude 4.7 Opus.

01Claude 4.7 Opus scored highest at 68/100 in a 14-model blind benchmark for WordPress plugin development.
02Zero out of 14 models hooked into the native Gravity Forms search input (#form_list_search); all injected a redundant new input element instead.
03No model proactively added keyboard shortcuts (Ctrl+F), updated the native item counter, or implemented background-fetching for paginated results.

Summary— our read of the original

After GitHub Copilot silently removed Claude Opus access for Pro accounts, developer guilamu set out to find a reliable replacement for WordPress plugin development. The benchmark covered 14 models — including frontier and local LLMs — each given a fresh IDE, zero prior context, and a minimal zero-shot prompt to build a "Gravity Forms Live Search" plugin. To eliminate personal bias, Gemini 3.1 Pro graded all anonymized outputs against a 100-point rubric benchmarked against a human reference implementation.

The results surfaced several consistent blind spots across all models.

The results surfaced several consistent blind spots across all models. Not one of the 14 successfully hooked into the native Gravity Forms search input (`#form_list_search`); every model injected a redundant new `<input>` element instead of analyzing the existing DOM. No model proactively added keyboard shortcuts (Ctrl+F), updated the native item counter as rows were hidden, or implemented background-fetching for paginated results. Most models also used a naive `.toLowerCase()` for text filtering, breaking on accented characters — only a few applied `.normalize('NFD')` for robust diacritic handling. Local models fared worst: Gemma4-26b running on CPU generated a fatal PHP error and scored just 18/100.

Claude 4.7 Opus led the leaderboard at 68/100, with GLM 5.1 a notable second at 61/100 — ahead of Claude 4.6 Opus (59/100), Mimo v2.5 pro (58/100), and Qwen 3.6+ (55/100). GPT 5.4 xHigh placed 9th at 49/100. GLM 5.1 was highlighted as a "value king" given its OpenRouter pricing of $1.05 in / $3.50 out, roughly 3–4× cheaper than Sonnet 4.6 and 5–7× cheaper than Opus 4.6/4.7. The author's key takeaway: even top models default to the path of least resistance, and native-feeling UX integration requires explicit prompting rather than relying on implicit model knowledge. A follow-up Level 2 test feeding models a WordPress + Gravity Forms reference file is planned.

Key facts

01Claude 4.7 Opus scored highest at 68/100 in a 14-model blind benchmark for WordPress plugin development.
02Zero out of 14 models hooked into the native Gravity Forms search input (#form_list_search); all injected a redundant new input element instead.
03No model proactively added keyboard shortcuts (Ctrl+F), updated the native item counter, or implemented background-fetching for paginated results.
04Most models used .toLowerCase() for filtering, breaking on accented characters; only a few applied .normalize('NFD') for diacritic handling.
05GLM 5.1 placed 2nd at 61/100 and is priced at $1.05 in / $3.50 out on OpenRouter — roughly 3–4× cheaper than Sonnet 4.6.
06Gemma4-26b running locally on CPU generated a fatal PHP error and scored 18/100, the lowest in the benchmark.
07GPT 5.4 xHigh placed 9th at 49/100; Gemini 3.1 Pro placed 7th at 53/100.

Topics

#benchmarks #code-generation #llm-comparison #wordpress-development #prompt-engineering

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 24, 2026 · 17:11 UTC. How this works →

Apr 23, 2026·1 min readResearch Papers

Blind benchmark of 14 LLMs on WordPress plugin dev reveals gaps

A developer blind-tested 14 LLMs on a zero-shot WordPress plugin task, finding Claude 4.7 Opus the top scorer at 68/100 while every model failed to hook into native Gravity Forms UI elements.

Hacker News·guilamu

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Claude 4.7 Opus scored highest at 68/100 in a 14-model blind benchmark for WordPress plugin development.
02Zero out of 14 models hooked into the native Gravity Forms search input (#form_list_search); all injected a redundant new input element instead.
03No model proactively added keyboard shortcuts (Ctrl+F), updated the native item counter, or implemented background-fetching for paginated results.

Summary— our read of the original

The results surfaced several consistent blind spots across all models.

Key facts

01Claude 4.7 Opus scored highest at 68/100 in a 14-model blind benchmark for WordPress plugin development.
02Zero out of 14 models hooked into the native Gravity Forms search input (#form_list_search); all injected a redundant new input element instead.
03No model proactively added keyboard shortcuts (Ctrl+F), updated the native item counter, or implemented background-fetching for paginated results.
04Most models used .toLowerCase() for filtering, breaking on accented characters; only a few applied .normalize('NFD') for diacritic handling.
05GLM 5.1 placed 2nd at 61/100 and is priced at $1.05 in / $3.50 out on OpenRouter — roughly 3–4× cheaper than Sonnet 4.6.
06Gemma4-26b running locally on CPU generated a fatal PHP error and scored 18/100, the lowest in the benchmark.
07GPT 5.4 xHigh placed 9th at 49/100; Gemini 3.1 Pro placed 7th at 53/100.

Topics

#benchmarks #code-generation #llm-comparison #wordpress-development #prompt-engineering

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics