Qwen3.5 27B and Gemma4 31B ace local coding eval, but 31B runs 10+ hours
u/Lowkey_LokiSN's follow-up personal eval finds both Qwen3.5-27B Q4 and Gemma4-31B Q4 fixed all 37 baseline failures with zero regressions, while Gemma4-26B Q8 actually performed worse than its Q4 counterpart, and Gemma4-31B suffered extreme slowness at over 10 hours of wall-clock time.
Score breakdown
Practitioners running local agentic coding workloads should weigh Qwen3.5-27B's token efficiency and speed against Gemma4-31B's perfect accuracy but extreme resource demands — over 10 hours of runtime and 70GB DRAM — before choosing a model for automated fix pipelines.
- 01Both Qwen3.5-27B Q4 and Gemma4-31B Q4 fixed all 37 baseline failures with zero regressions, achieving a perfect net score of 37.
- 02Gemma4-26B Q8 fixed only 17 of 37 failures — worse than its Q4 counterpart (28 fixes), ruling out quantization tax as the cause of earlier poor results.
- 03Qwen3.5-27B was the most token-efficient model at ~16K tokens per fix and a grand total of 595,320 input+output tokens.
u/Lowkey_LokiSN published a follow-up personal eval on r/LocalLLaMA comparing five locally-run models on an agentic coding benchmark with 37 baseline failures. The three new additions to the previous comparison were Gemma4-26B at Q8_K_XL quantization (to test whether the earlier poor result was a quantization tax), Qwen3.5-27B Q4_K_XL (a dense model many were curious about), and Gemma4-31B Q4_K_XL (another dense model). Both dense models — Qwen3.5-27B and Gemma4-31B — fixed all 37 failures with zero regressions, achieving a perfect net score of 37. By contrast, Qwen3.6-35B Q4 fixed 32 (net score 32), Gemma4-26B Q4 fixed 28 but introduced 8 regressions (net score 20), and Gemma4-26B Q8 fixed only 17 with zero regressions (net score 17). The author concludes the dense models are "in a different league" from the MoE ones.
On efficiency, Qwen3.5-27B stood out as the most token-efficient, spending ~16K tokens per fix and a grand total of 595,320 tokens (I+O), the lowest of all five models.
On efficiency, Qwen3.5-27B stood out as the most token-efficient, spending ~16K tokens per fix and a grand total of 595,320 tokens (I+O), the lowest of all five models. It also made the most tool calls (181 total) and read files the most thoroughly (4.0 reads per unique file). Gemma4-31B was the cleanest tool caller — 100 calls, all successful — but its wall-clock time of 37,748 seconds (629 minutes, or over 10 hours) was drastically slower than all other models, with an average step duration of 82.2 seconds versus Qwen3.6-35B's 10.0 seconds. The author also noted DRAM usage ballooned to 70GB on Gemma4-31B even with `-cram` and `-ctkcp` flags set, flagging this as a potential issue. The Q8 upgrade for Gemma4-26B was deemed not worth it — performance was similar to or slightly worse than Q4, and the author plans to stick with Q4_K_XL for that model.
Key facts
- 01Both Qwen3.5-27B Q4 and Gemma4-31B Q4 fixed all 37 baseline failures with zero regressions, achieving a perfect net score of 37.
- 02Gemma4-26B Q8 fixed only 17 of 37 failures — worse than its Q4 counterpart (28 fixes), ruling out quantization tax as the cause of earlier poor results.
- 03Qwen3.5-27B was the most token-efficient model at ~16K tokens per fix and a grand total of 595,320 input+output tokens.
- 04Gemma4-31B ran for 37,748 seconds (629 minutes, over 10 hours) with an average step duration of 82.2 seconds.
- 05Gemma4-31B's DRAM usage ballooned to 70GB even with `-cram` and `-ctkcp` flags enabled.
- 06Gemma4-31B made 100 tool calls with a 100% success rate, the cleanest tool-call record of all five models.