Run a local coding agent on macOS with llama.cpp and MTP speculative decoding
Kyle Howells details how to set up a fully local coding agent on macOS using Gemma 4 26B-A4B and llama.cpp with MTP speculative decoding, achieving 72.2 tokens/second on an Apple M1 Max with 64 GB unified memory.
Score breakdown
The guide demonstrates that a fully local, offline-capable coding agent running on consumer Apple Silicon hardware can reach usable generation speeds through llama.cpp MTP speculative decoding, outperforming the Mac-native MLX runtime for this workload.
- 01Tested on an Apple M1 Max with 64 GB unified memory running macOS 15.7.7
- 02Main model is gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf (~16 GB); full model folder with MTP draft head and multimodal projector is ~17 GB
- 03Baseline llama.cpp + Metal generated at 58.2 tokens/second
Kyle Howells describes building a local coding agent on macOS motivated by internet outages that cut off access to cloud AI tools. The final setup runs on an Apple M1 Max with 64 GB unified memory under macOS 15.7.7, using `llama.cpp` compiled with Metal acceleration, the `gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` quantized model (approximately 16 GB on disk, ~17 GB with the MTP draft head and multimodal projector), the `gemma-4-26B-A4B-it-Q8_0-MTP.gguf` draft model for Multi-Token Prediction speculative decoding, and Pi as the terminal-based coding agent. The setup exposes an OpenAI-compatible API so it can be used with other tools, and includes the Gemma 4 multimodal projector to support feeding screenshots to the model.
Benchmarking with a fixed 128-token prompt showed the baseline `llama.cpp` + Metal configuration generating at 58.2 tokens/second.
Benchmarking with a fixed 128-token prompt showed the baseline `llama.cpp` + Metal configuration generating at 58.2 tokens/second. Adding the MTP draft model and sweeping `--spec-draft-n-max` values from 1 to 6 found that 3 draft tokens was optimal on the M1 Max, yielding 72.2 tokens/second — a 1.24x improvement over the baseline. A comparison against `mlx-lm` showed `llama.cpp` with MTP was the clear winner: the best MLX result was 45.8 tokens/second with an Unsloth UD MLX 4-bit model, and other MLX variants came in lower still. The article also notes a configuration detail for Pi: the local model entry must declare `"input": ["text", ...]` with image support enabled, otherwise Pi does not pass image tool output to the model correctly. The source text is truncated before the image-support configuration is fully described.
Key facts
- 01Tested on an Apple M1 Max with 64 GB unified memory running macOS 15.7.7
- 02Main model is gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf (~16 GB); full model folder with MTP draft head and multimodal projector is ~17 GB
- 03Baseline llama.cpp + Metal generated at 58.2 tokens/second
- 04Adding the Q8 MTP draft model with --spec-draft-n-max 3 raised generation to 72.2 tokens/second (1.24x speedup)
- 05Sweeping --spec-draft-n-max from 1–6 found 3 draft tokens optimal on M1 Max; values above 4 degraded performance
- 06llama.cpp + MTP outperformed all tested MLX runtimes; best MLX result was 45.8 tok/s via mlx-lm with Unsloth UD MLX 4-bit
- 07Pi is used as the terminal coding agent, with the model entry requiring image input to be declared for multimodal support
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 13, 2026 · 08:58 UTC. How this works →