Apr 22, 2026·2 min readApplications & Use Cases

CrabTrap LLM-as-a-judge proxy tested in production

Juan Torchia ran the CrabTrap Rust HTTP proxy — which uses an LLM to judge another LLM's responses before execution — in production for 72 hours and found that the judge itself can be manipulated by adversarial content, undermining its core security premise.

Dev.to #llm·Juan Torchia

Read at source

Composite

5.0

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers building production agents should treat LLM-as-a-judge proxies like CrabTrap as observability and logging tools rather than security boundaries, and must account for judge timeouts, missing conversation context, and adversarial manipulation before relying on them to block harmful actions.

01CrabTrap is an HTTP proxy written in Rust that routes every agent response through a 'judge' LLM before allowing execution, blocking responses the judge deems unsafe.
02Torchia ran CrabTrap for 72 hours in front of a Next.js/TypeScript agent on Railway calling Claude via the Anthropic API, using `claude-haiku-3` as the judge model.
03A prompt injection payload hidden in an HTML comment received a safety score of 0.71 from the judge and was not blocked, demonstrating the judge can be manipulated by the content it evaluates.

Summary— our read of the original

Juan Torchia describes deploying CrabTrap — a Rust HTTP proxy that sits between an agent and its upstream API, routing every response through a separate "judge" LLM to evaluate safety before execution — after his production agent nearly executed an `rm -rf` command embedded in a markdown code block. His setup is a Next.js/TypeScript agent on Railway, backed by PostgreSQL, making calls to Claude through the Anthropic API. He configured CrabTrap as a Docker sidecar with `claude-haiku-3` as the judge model (chosen to control costs), a 3,000 ms judge timeout, block-only mode, and full request/response logging enabled.

After 72 hours, Torchia's central finding is that using an LLM to judge another LLM's output reproduces the exact trust problem it is meant to solve.

After 72 hours, Torchia's central finding is that using an LLM to judge another LLM's output reproduces the exact trust problem it is meant to solve. In a direct test, he sent a prompt injection payload concealed inside an HTML comment; the Haiku judge returned a safety score of 0.71 and did not block it. He frames this as a structural limitation — the judging model can be manipulated by the content it is judging — and contrasts it with Emacs's `safe-local-variables` whitelist, which is built and audited by humans and cannot be overwritten at runtime by the content it processes.

Beyond the core trust issue, Torchia catalogued four concrete production problems. First, when the Anthropic API rate-limited the judge at 2am, three consecutive timeouts caused his agent to be fully blocked for three minutes, exposing the need for a circuit breaker or non-blocking fallback policy. Second, because the judge evaluates each response in isolation with no conversation context, it blocked legitimate steps in multi-step tasks — he recorded five such false positives in the 72-hour window. Third, `log_all = true` writes the full plaintext content of every request and response to disk, creating an unencrypted audit log that poses a data exposure risk for agents handling sensitive information. Fourth, false positives in an agent with memory can corrupt the context of an entire session, causing cascading failures. He concludes that CrabTrap has legitimate uses — structured logging, pattern detection, traffic visibility — but that framing it as a security guarantee overstates what the mechanism can deliver against an adversary who understands the system.

Key facts

01CrabTrap is an HTTP proxy written in Rust that routes every agent response through a 'judge' LLM before allowing execution, blocking responses the judge deems unsafe.
02Torchia ran CrabTrap for 72 hours in front of a Next.js/TypeScript agent on Railway calling Claude via the Anthropic API, using `claude-haiku-3` as the judge model.
03A prompt injection payload hidden in an HTML comment received a safety score of 0.71 from the judge and was not blocked, demonstrating the judge can be manipulated by the content it evaluates.
04Three consecutive judge timeouts caused by an Anthropic rate limit at 2am blocked the agent entirely for three minutes.
05The judge evaluates each response in isolation with no conversation context, producing false positives on multi-step tasks — five occurred in 72 hours.
06`log_all = true` saves the full plaintext content of every request and response to disk, creating an unencrypted audit log.
07Torchia argues the approach has legitimate uses for logging and visibility but cannot provide the security guarantees implied by its 'judge' framing.

Topics

#agent-safety #production-agents #llm-judge #proxy-architecture #code-execution

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 19:13 UTC. How this works →

CrabTrap LLM-as-a-judge proxy tested in production

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics