Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth, KV cache quants, and long context
u/OsmanthusBloom ran 144 benchmark runs of Qwen3.6-35B-A3B GGUFs using tool-eval-bench 2.0.4, comparing ByteShape and Unsloth quants across three KV cache quantization levels and two context lengths, finding no clear winner between providers, q8_0 KV cache as essentially free, q4_0 as noticeably worse, and long context as a significant performance degrader across all configurations.
Score breakdown
This benchmark directly addresses a gap the post identifies — the lack of tool-calling quality evaluations for popular local GGUF quants — and provides concrete, reproducible evidence that KV cache quantization level and context length have measurable effects on tool-calling accuracy for Qwen3.6-35B-A3B.
- 01Eight Qwen3.6-35B-A3B GGUFs were tested: four ByteShape and four Unsloth, ranging from 13.2 GB to 29.3 GB.
- 02Benchmarking used tool-eval-bench 2.0.4 in --hardmode: 84 tests, max 168 total points (2 pts full, 1 pt partial, 0 pts failure).
- 03Three KV cache settings were compared: f16 (default), q8_0/q8_0, and q4_0/q4_0.
u/OsmanthusBloom on r/LocalLLaMA conducted a structured qualitative benchmark of Qwen3.6-35B-A3B, motivated by a community discussion about the lack of tool-calling evaluations for local models. The eight GGUFs tested spanned a range of sizes and quantization types: ByteShape IQ3_S-3.48bpw (15.1 GB), ByteShape IQ4_XS-4.15bpw (18.0 GB), ByteShape Q4_K_S-4.22bpw (18.3 GB), Unsloth UD-IQ3_XXS (13.2 GB), Unsloth UD-Q3_K_XL (16.8 GB), Unsloth UD-IQ4_XS (17.7 GB), Unsloth UD-Q4_K_M (22.1 GB), and Unsloth UD-Q6_K (29.3 GB). The software stack was llama.cpp version 9529 (96fbe0039) with CUDA support and tool-eval-bench 2.0.4 in `--hardmode`, which runs 84 separate tests scored at 2 points for a fully correct tool call, 1 for partially correct, and 0 for failure, for a theoretical maximum of 168 total points. Three KV cache configurations were tested — f16, q8_0/q8_0, and q4_0/q4_0 — and context pressure was set to either 0.0 (approximately 5k-token system prompt) or 0.5 (an additional 122k tokens of filler text simulating a half-filled context window). Each of the 144 combinations was run three times with different random seeds.
The total compute consumed was approximately 300 GPU-hours including experimental and failed runs.
The short-context runs took approximately 15 minutes each while the long-context runs took around 4 hours each, with the bottleneck being prefill (PP) speed; `--ubatch-size` was increased to 2048 to partially address this. The total compute consumed was approximately 300 GPU-hours including experimental and failed runs. The top-level conclusions are that ByteShape and Unsloth quants of comparable size show no clear quality winner, q8_0 KV cache quantization costs essentially nothing in tool-calling accuracy, q4_0 KV cache quantization is noticeably worse, and switching from short to long context (the 0.5 context-pressure setting) significantly degrades tool-calling performance across all quants and KV cache configurations tested. The source text is truncated before detailed per-quant score tables are presented.
Key facts
- 01Eight Qwen3.6-35B-A3B GGUFs were tested: four ByteShape and four Unsloth, ranging from 13.2 GB to 29.3 GB.
- 02Benchmarking used tool-eval-bench 2.0.4 in --hardmode: 84 tests, max 168 total points (2 pts full, 1 pt partial, 0 pts failure).
- 03Three KV cache settings were compared: f16 (default), q8_0/q8_0, and q4_0/q4_0.
- 04Context pressure 0.0 = ~5k-token short prompt; context pressure 0.5 = an additional 122k tokens of filler text.
- 05144 total runs (8 GGUFs × 3 KV quants × 2 context lengths × 3 random seeds); ~300 GPU-hours on 32GB V100s.
- 06Key findings: no clear winner between ByteShape and Unsloth; q8_0 KV cache is free lunch; q4_0 KV cache is worse; long context significantly degrades tool calling across all configurations.
- 07llama.cpp version 9529 (96fbe0039) with CUDA support was used as the inference backend.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →