Serve Qwen3.6-35B-A3B locally and build a coding agent with tool calling
Lavelle Hatcher Jr walks through serving Qwen3.6-35B-A3B — a 35B sparse MoE model scoring 73.4% on SWE-bench Verified — locally with vLLM and wiring it up as a tool-calling coding agent via the OpenAI SDK.
Score breakdown
Lavelle Hatcher Jr published a hands-on guide to running Qwen3.6-35B-A3B, released April 16, 2026 under Apache 2.0, as a local coding agent. The sparse mixture-of-experts model has 35 billion total parameters but only ~3 billion active per token, and scores 73.4% on SWE-bench Verified and 37.0 on MCPMark — more than double Gemma 4-31B's MCPMark score of 18.1. The tutorial covers installing `vllm>=0.19.0`, launching the OpenAI-compatible server with the critical `--tool-call-parser qwen3_coder` flag, and building a multi-step agentic loop where the model calls tools like `search_files`, `read_file`, and `write_file` to autonomously complete coding tasks.
Published on Dev.to by Lavelle Hatcher Jr, this tutorial covers the full stack for running Qwen3.6-35B-A3B as a self-hosted coding agent. The model, released April 16, 2026 under Apache 2.0, is a sparse MoE architecture with 35 billion total parameters and only ~3 billion active per token. It achieves 73.4% on SWE-bench Verified and 37.0 on MCPMark, making it one of the strongest open-weight models for agentic coding currently available. The guide requires `vllm>=0.19.0` (older versions lack support for the `Qwen3MoeSparseMoeBlock` architecture), Python 3.12, and at minimum a single NVIDIA RTX 4090 24GB GPU. On that hardware, `--max-model-len` must be capped at 32768 or 65536 — attempting the full 262,144-token context will OOM. For full context, a 4-GPU setup with `--tensor-parallel-size 4` is recommended.
The article highlights two flags as non-negotiable for agentic use: `--reasoning-parser qwen3` enables the model's internal thinking mode, improving accuracy on coding tasks, and `--tool-call-parser qwen3_coder` is described as the single most common setup mistake — without it, the model emits valid tool-call JSON but vLLM silently fails to parse it into structured `tool_calls` objects. Once the server is running at `http://localhost:8000/v1`, the standard OpenAI Python SDK connects with a one-line `base_url` swap. The tutorial then defines three tools — `search_files`, `read_file`, and `write_file` — and implements a multi-step `run_agent` loop capped at 10 steps, where the model iteratively calls tools, receives results, and decides its next action until the coding task is complete.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 19, 2026 · 11:33 UTC. How this works →