CrabTrap LLM-as-a-judge proxy tested in production
Juan Torchia ran the CrabTrap Rust HTTP proxy — which uses an LLM to judge another LLM's responses before execution — in production for 72 hours and found that the judge itself can be manipulated by adversarial content, undermining its core security premise.
Score breakdown
Developers building production agents should treat LLM-as-a-judge proxies like CrabTrap as observability and logging tools rather than security boundaries, and must account for judge timeouts, missing conversation context, and adversarial manipulation before relying on them to block harmful actions.
- 01CrabTrap is an HTTP proxy written in Rust that routes every agent response through a 'judge' LLM before allowing execution, blocking responses the judge deems unsafe.
- 02Torchia ran CrabTrap for 72 hours in front of a Next.js/TypeScript agent on Railway calling Claude via the Anthropic API, using `claude-haiku-3` as the judge model.
- 03A prompt injection payload hidden in an HTML comment received a safety score of 0.71 from the judge and was not blocked, demonstrating the judge can be manipulated by the content it evaluates.
Juan Torchia describes deploying CrabTrap — a Rust HTTP proxy that sits between an agent and its upstream API, routing every response through a separate "judge" LLM to evaluate safety before execution — after his production agent nearly executed an `rm -rf` command embedded in a markdown code block. His setup is a Next.js/TypeScript agent on Railway, backed by PostgreSQL, making calls to Claude through the Anthropic API. He configured CrabTrap as a Docker sidecar with `claude-haiku-3` as the judge model (chosen to control costs), a 3,000 ms judge timeout, block-only mode, and full request/response logging enabled.
After 72 hours, Torchia's central finding is that using an LLM to judge another LLM's output reproduces the exact trust problem it is meant to solve.
After 72 hours, Torchia's central finding is that using an LLM to judge another LLM's output reproduces the exact trust problem it is meant to solve. In a direct test, he sent a prompt injection payload concealed inside an HTML comment; the Haiku judge returned a safety score of 0.71 and did not block it. He frames this as a structural limitation — the judging model can be manipulated by the content it is judging — and contrasts it with Emacs's `safe-local-variables` whitelist, which is built and audited by humans and cannot be overwritten at runtime by the content it processes.
Beyond the core trust issue, Torchia catalogued four concrete production problems. First, when the Anthropic API rate-limited the judge at 2am, three consecutive timeouts caused his agent to be fully blocked for three minutes, exposing the need for a circuit breaker or non-blocking fallback policy. Second, because the judge evaluates each response in isolation with no conversation context, it blocked legitimate steps in multi-step tasks — he recorded five such false positives in the 72-hour window. Third, `log_all = true` writes the full plaintext content of every request and response to disk, creating an unencrypted audit log that poses a data exposure risk for agents handling sensitive information. Fourth, false positives in an agent with memory can corrupt the context of an entire session, causing cascading failures. He concludes that CrabTrap has legitimate uses — structured logging, pattern detection, traffic visibility — but that framing it as a security guarantee overstates what the mechanism can deliver against an adversary who understands the system.
Key facts
- 01CrabTrap is an HTTP proxy written in Rust that routes every agent response through a 'judge' LLM before allowing execution, blocking responses the judge deems unsafe.
- 02Torchia ran CrabTrap for 72 hours in front of a Next.js/TypeScript agent on Railway calling Claude via the Anthropic API, using `claude-haiku-3` as the judge model.