Apr 22, 2026·1 min readApplications & Use Cases

Qwen3.5 27B and Gemma4 31B ace local coding eval, but 31B runs 10+ hours

u/Lowkey_LokiSN's follow-up personal eval finds both Qwen3.5-27B Q4 and Gemma4-31B Q4 fixed all 37 baseline failures with zero regressions, while Gemma4-26B Q8 actually performed worse than its Q4 counterpart, and Gemma4-31B suffered extreme slowness at over 10 hours of wall-clock time.

r/LocalLLaMA·u/Lowkey_LokiSN

Read at source

Composite

5.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Practitioners running local agentic coding workloads should weigh Qwen3.5-27B's token efficiency and speed against Gemma4-31B's perfect accuracy but extreme resource demands — over 10 hours of runtime and 70GB DRAM — before choosing a model for automated fix pipelines.

01Both Qwen3.5-27B Q4 and Gemma4-31B Q4 fixed all 37 baseline failures with zero regressions, achieving a perfect net score of 37.
02Gemma4-26B Q8 fixed only 17 of 37 failures — worse than its Q4 counterpart (28 fixes), ruling out quantization tax as the cause of earlier poor results.
03Qwen3.5-27B was the most token-efficient model at ~16K tokens per fix and a grand total of 595,320 input+output tokens.

Summary— our read of the original

u/Lowkey_LokiSN published a follow-up personal eval on r/LocalLLaMA comparing five locally-run models on an agentic coding benchmark with 37 baseline failures. The three new additions to the previous comparison were Gemma4-26B at Q8_K_XL quantization (to test whether the earlier poor result was a quantization tax), Qwen3.5-27B Q4_K_XL (a dense model many were curious about), and Gemma4-31B Q4_K_XL (another dense model). Both dense models — Qwen3.5-27B and Gemma4-31B — fixed all 37 failures with zero regressions, achieving a perfect net score of 37. By contrast, Qwen3.6-35B Q4 fixed 32 (net score 32), Gemma4-26B Q4 fixed 28 but introduced 8 regressions (net score 20), and Gemma4-26B Q8 fixed only 17 with zero regressions (net score 17). The author concludes the dense models are "in a different league" from the MoE ones.

On efficiency, Qwen3.5-27B stood out as the most token-efficient, spending ~16K tokens per fix and a grand total of 595,320 tokens (I+O), the lowest of all five models.

On efficiency, Qwen3.5-27B stood out as the most token-efficient, spending ~16K tokens per fix and a grand total of 595,320 tokens (I+O), the lowest of all five models. It also made the most tool calls (181 total) and read files the most thoroughly (4.0 reads per unique file). Gemma4-31B was the cleanest tool caller — 100 calls, all successful — but its wall-clock time of 37,748 seconds (629 minutes, or over 10 hours) was drastically slower than all other models, with an average step duration of 82.2 seconds versus Qwen3.6-35B's 10.0 seconds. The author also noted DRAM usage ballooned to 70GB on Gemma4-31B even with `-cram` and `-ctkcp` flags set, flagging this as a potential issue. The Q8 upgrade for Gemma4-26B was deemed not worth it — performance was similar to or slightly worse than Q4, and the author plans to stick with Q4_K_XL for that model.

Key facts

01Both Qwen3.5-27B Q4 and Gemma4-31B Q4 fixed all 37 baseline failures with zero regressions, achieving a perfect net score of 37.
02Gemma4-26B Q8 fixed only 17 of 37 failures — worse than its Q4 counterpart (28 fixes), ruling out quantization tax as the cause of earlier poor results.
03Qwen3.5-27B was the most token-efficient model at ~16K tokens per fix and a grand total of 595,320 input+output tokens.
04Gemma4-31B ran for 37,748 seconds (629 minutes, over 10 hours) with an average step duration of 82.2 seconds.
05Gemma4-31B's DRAM usage ballooned to 70GB even with `-cram` and `-ctkcp` flags enabled.
06Gemma4-31B made 100 tool calls with a 100% success rate, the cleanest tool-call record of all five models.
07Qwen3.6-35B Q4 was the fastest model at 2,950 seconds (49 minutes) wall-clock time with a 10.0s average step duration.

Topics

#benchmarks #code-generation #open-source #model-release #quantization

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 19:13 UTC. How this works →

Qwen3.5 27B and Gemma4 31B ace local coding eval, but 31B runs 10+ hours

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics