Qwen3.6-35B hits 78.7% on Polyglot with little-coder scaffold
u/Creative-Regular6799 reports that pairing the local Qwen3.6 35B model with the `little-coder` agent scaffold achieves a 78.7% success rate on the public Polyglot benchmark, placing it in the top 10 and making it competitive with leading cloud models.
Score breakdown
Developers running local models should evaluate whether their agent scaffold — not just the model itself — is the bottleneck, as `little-coder` demonstrates that the right harness can close much of the gap between local and cloud model coding performance.
- 01Qwen3.6 35B paired with the `little-coder` scaffold achieves a 78.7% success rate on the public Polyglot benchmark.
- 02This places Qwen3.6 35B in the Polyglot public top 10, competitive with leading cloud models.
- 03A prior experiment showed changing the scaffold around a 9B Qwen model moved its score from 19.11% to 45.56%.
u/Creative-Regular6799 posted a follow-up on r/LocalLLaMA demonstrating that the `little-coder` agent scaffold dramatically improves the coding benchmark performance of local models. In a prior experiment, simply changing the scaffold around the same 9B Qwen model pushed its score from 19.11% to 45.56%. Applying the same approach to the larger Qwen3.6 35B model produced even more striking results: a 78.7% success rate on the public Polyglot benchmark, placing it in the top 10 overall and making it genuinely competitive with leading cloud-hosted models.
Beyond Polyglot, the Qwen3.6 35B also achieved a 40% success rate on Terminal Bench 1 (`0.1.1`), with the author noting no other model of comparable size performs in that range.
The author argues this points to a systemic issue of "harness mismatch" — the hypothesis that local coding models have been evaluated inside scaffolds designed for a different class of model, artificially suppressing their apparent performance. Beyond Polyglot, the Qwen3.6 35B also achieved a 40% success rate on Terminal Bench 1 (`0.1.1`), with the author noting no other model of comparable size performs in that range. Terminal Bench 2 results are pending, and GAIA is next on the evaluation roadmap. The `little-coder` scaffold is open-sourced on GitHub, a full write-up is available on Substack, and a `pi.dev` adaptation was added following community requests.
Key facts
- 01Qwen3.6 35B paired with the `little-coder` scaffold achieves a 78.7% success rate on the public Polyglot benchmark.
- 02This places Qwen3.6 35B in the Polyglot public top 10, competitive with leading cloud models.
- 03A prior experiment showed changing the scaffold around a 9B Qwen model moved its score from 19.11% to 45.56%.
- 04The author attributes part of the cloud-vs-local performance gap to 'harness mismatch' — local models tested in scaffolds built for a different model class.
- 05Qwen3.6 35B scored 40% on Terminal Bench 1 (version `0.1.1`), with no comparable model near its parameter size reaching that level.
- 06Terminal Bench 2 and GAIA (for research capabilities) are the next planned benchmarks.
- 07The `little-coder` scaffold is open-sourced on GitHub with a full write-up published on Substack.