Semantic memory MCP server cures Claude Code's context bloat
Kunal Jaiswal built a 767-line Python MCP memory server that replaces flat `.md` documentation files with a semantic knowledge base, eliminating the context-window drain that was degrading Claude Code's performance across a 30+ service home automation stack.
Score breakdown
Developers managing large, multi-service codebases with Claude Code can adopt this MCP-based semantic memory pattern to dramatically reduce context-window overhead and prevent the model from re-exploring already-documented knowledge.
- 01The home automation codebase grew to 30+ services across 5 machines, each with its own .md documentation file averaging 200–400 lines.
- 02Loading 10 documentation files for a cross-system task consumed 3,000–4,000 tokens before Claude wrote any code.
- 03The MCP memory server is 767 lines of Python, exposing 6 tools: `memory_search`, `memory_save`, `memory_list`, `memory_update`, `memory_delete`, and `memory_stats`.
Kunal Jaiswal's home automation stack spans five machines running 30+ services — camera monitors, WhatsApp agents, LLM inference pipelines, job scrapers, and diet trackers. He maintained one `.md` file per service, each averaging 200–400 lines and containing architecture decisions, port numbers, credentials, and bug histories. At five files the system worked well; at thirty, Claude Code began re-exploring already-documented code, recommending ports already in use, and missing cross-service dependencies. The root cause was context-window exhaustion: loading ten documentation files for a cross-system task consumed 3,000–4,000 tokens before Claude wrote a single line of code, and Claude sometimes read five files before finding an answer in the sixth.
His solution is a 767-line Python server that exposes both an MCP/SSE interface (for Claude Code) and a REST interface (for other agents) on port 8042.
Jaiswal argues that standard RAG over documentation files fails because the retrieval unit is the wrong size — a full `.md` file wastes context with irrelevant sections, while chunking it breaks structural relationships between sections. His solution is a 767-line Python server that exposes both an MCP/SSE interface (for Claude Code) and a REST interface (for other agents) on port 8042. The server uses `sentence-transformers` with the `all-MiniLM-L6-v2` model to generate 384-dimensional embeddings, compresses them to 4-bit with TurboQuant for a smaller in-memory footprint, and stores content, tags, category, agent ID, and timestamps in MySQL. Search is cosine similarity over the compressed vectors.
The critical behavioral change came from a "Memory-First Rule" added to `CLAUDE.md`, mandating that Claude always call `memory_search` before reading any file, exploring code, or launching an Explore agent, with local `.md` files as a fallback only when the memory server is unreachable. Jaiswal imported his 30 existing documentation files by splitting them on `##` headers, converting each section into a self-contained memory entry — 30 files became 200+ discrete facts, each tagged with its source file and category. The result is that a lookup that previously required reading multiple files now completes in under 100ms via a single semantic search, keeping the context window free for actual coding work.
Key facts
- 01The home automation codebase grew to 30+ services across 5 machines, each with its own .md documentation file averaging 200–400 lines.
- 02Loading 10 documentation files for a cross-system task consumed 3,000–4,000 tokens before Claude wrote any code.
- 03The MCP memory server is 767 lines of Python, exposing 6 tools: `memory_search`, `memory_save`, `memory_list`, `memory_update`, `memory_delete`, and `memory_stats`.