Lemonade v10.8 adds auto memory management and MCP gateway for local models
Lemonade v10.8 ships dynamic VRAM management, a provider-agnostic cloud offload backend, expanded LMX-Omni image generation controls, and an MCP gateway that exposes local models as callable tools.
Score breakdown
The MCP gateway turns a local Lemonade server into a set of callable tools for any MCP-aware host, removing the need to route those requests to a cloud API.
- 01v10.8 was a 20-contributor release shipped in 7 days
- 02Dynamic VRAM management auto-unloads idle models and downsizes KV-cache to reclaim GPU memory on the fly
- 03Model pinning prevents chosen models from being evicted from GPU memory
Lemonade v10.8 lands a set of memory and context management improvements designed to reduce manual tuning. Dynamic VRAM management now auto-unloads idle models and downsizes their KV-cache to reclaim GPU memory on the fly, while a model-pinning feature ensures frequently used models are never evicted. Automatic context sizing removes the need to hand-tune context length by deriving it from available memory and the model's architecture.
LMX-Omni gains expanded image generation controls (size, steps, etc.) and the ability to pull and share custom omni models from Hugging Face.
A new provider-agnostic cloud offload backend allows chat completions to be served from any OpenAI-compatible provider — Fireworks, OpenRouter, Together, or OpenAI — right alongside local models, with switching available from the CLI or UI. The post describes the design philosophy as "local-first, with cloud as an option, not a default," with a stated goal of eventually enabling applications to route between client and cloud based on their own routing policies. LMX-Omni gains expanded image generation controls (size, steps, etc.) and the ability to pull and share custom omni models from Hugging Face.
The MCP gateway (`POST /mcp`) exposes five tools — model listing, chat, audio transcription, image generation, and multimodal omni — allowing any MCP-aware host to call local Lemonade models as tools instead of reaching for a cloud API. Platform expansion in this release covers NVIDIA GB10 (Blackwell) arm64 CUDA, TheRock ROCm on Windows for Radeon RX GPUs, ROCm for Radeon 840M/860M iGPUs, `whisper.cpp` moved to ROCm on Windows and Linux, a dedicated Debian 13 build, and a CDNA datacenter GPU detection fix.
Key facts
- 01v10.8 was a 20-contributor release shipped in 7 days
- 02Dynamic VRAM management auto-unloads idle models and downsizes KV-cache to reclaim GPU memory on the fly
- 03Model pinning prevents chosen models from being evicted from GPU memory
- 04Automatic context sizing derives context length from available memory and model architecture, removing manual tuning
- 05A provider-agnostic cloud offload backend supports Fireworks, OpenRouter, Together, and OpenAI alongside local models
- 06An MCP gateway (POST /mcp) exposes five tools: model listing, chat, audio transcription, image generation, and multimodal omni
- 07Platform support added for NVIDIA GB10 (Blackwell) arm64 CUDA, TheRock ROCm on Windows, and ROCm for Radeon 840M/860M iGPUs
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 18, 2026 · 10:40 UTC. How this works →