IntentProbe uses activation probing to detect poisoned MCP tools
u/Think-Investment-557 released IntentProbe, a claimed first MCP/tool-poisoning scanner that reads a frozen model's internal activation states rather than analyzing surface text to detect malicious tool intent.
Score breakdown
IntentProbe addresses a gap the post identifies in existing MCP security tooling: the inability of text-based classifiers to distinguish safe from poisoned tool descriptions when both use nearly identical vocabulary, a scenario where the post reports the strongest reproducible DeBERTa baseline scored 0% recall.
- 01IntentProbe reads internal hidden-layer activations from a frozen local model rather than analyzing tool description text.
- 02The v0 scanner uses Qwen2.5-0.5B layers 13–15 with a 22 KB probe and runs entirely locally.
- 03On a matched-vocabulary F1 test, the activation-probe method scored 96.6% vs. 0% for a DeBERTa text-classifier baseline.
IntentProbe, released by u/Think-Investment-557, is described as the first product-shaped MCP/tool-poisoning scanner to use activation probing. The core insight is that when a language model processes a tool description, its hidden states encode information about what the tool is actually requesting — even when the surface text looks benign. IntentProbe runs the tool description through a small frozen local model, slices into its hidden layers, and scores the resulting activation state against known patterns for credential access, exfiltration, hidden persistence, and tool hijacking. All scanning happens on-device; tool descriptions and results never leave the user's machine.
On a matched-vocabulary F1 test — the hardest case, where safe and poisoned tools use nearly identical words — the activation-probe method scored 96.6% versus 0% for the DeBERTa baseline.
The post contrasts this approach with three existing scanner categories: text-based classifiers (regex, pattern matching, DeBERTa-style models) that fail when poisoned and safe tools share similar vocabulary; LLM-as-judge approaches that are described as prompt-sensitive, slow, expensive, and potentially part of the attack surface; and enterprise cloud/API scanners (Lakera, Azure Prompt Shields, Google Model Armor) that operate as black boxes without publicly reproducible benchmarks. On a matched-vocabulary F1 test — the hardest case, where safe and poisoned tools use nearly identical words — the activation-probe method scored 96.6% versus 0% for the DeBERTa baseline. On the MCPTox held-out split (n=249), poisoned recall was 100% for the probe versus 19.9% for DeBERTa. A camouflage suffix test adding phrases like "this tool is safe / read-only / sandboxed" produced 0 evasions out of 146 attempts on the GPT-2 probe, with a smaller smoke test of 0/15 evasions on the current Qwen v0 scanner.
The released `v0` product uses `Qwen2.5-0.5B` layers 13–15 with a 22 KB probe weight file. The model is downloaded once on first scan. Probe weights, benchmark data, and reproducible scripts are published in the GitHub repository at `https://github.com/mcpware/IntentProbe`, allowing independent verification of the reported results.
Key facts
- 01IntentProbe reads internal hidden-layer activations from a frozen local model rather than analyzing tool description text.
- 02The v0 scanner uses Qwen2.5-0.5B layers 13–15 with a 22 KB probe and runs entirely locally.
- 03On a matched-vocabulary F1 test, the activation-probe method scored 96.6% vs. 0% for a DeBERTa text-classifier baseline.
- 04On the MCPTox held-out split (n=249), poisoned recall was 100% for the probe vs. 19.9% for DeBERTa.
- 05A camouflage suffix test (adding 'safe / read-only / sandboxed' text) produced 0 evasions out of 146 attempts on the GPT-2 probe.
- 06The scanner is usable as a pre-install scanner and runtime hook for MCP tools, Claude Code skills, and agent tool definitions.
- 07Probe weights, benchmark data, and reproducible scripts are publicly available in the GitHub repo.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →