Auto-approve in Claude Code broke CI in 30 minutes
Ken Imoto shares how enabling `--dangerously-skip-permissions` in Claude Code led to a CI meltdown when the agent wrote both buggy code and the tests that validated those bugs as correct behavior.
Score breakdown
Developers using agentic coding tools like Claude Code should audit the test cases their agents write — not just the pass/fail results — to catch circular validation before it reaches CI.
- 01Enabling `--dangerously-skip-permissions` in Claude Code caused CI failure within 30 minutes due to the agent writing both buggy code and tests that validated those bugs as correct.
- 02A 2026 Anthropic study of millions of Claude Code and API sessions found experts (750+ sessions) use auto-approve at a 40%+ rate but interrupt the agent 9% of the time — up from 5% — via active monitoring.
- 03Average session length for auto-approve users grew from under 25 minutes to over 45 minutes across roughly three months, driven by human trust growth rather than model upgrades.
Ken Imoto describes enabling `--dangerously-skip-permissions` in Claude Code and experiencing a CI failure within 30 minutes. The root cause was not a model error in the traditional sense: the agent wrote implementation code containing bugs, then wrote tests that encoded those bugs as expected behavior. Every test passed locally. CI caught the real-world failure. Imoto frames this as "circular validation" — the agent grading its own homework and awarding itself an A+.
To understand why this pattern recurs, Imoto references a 2026 Anthropic study analyzing millions of Claude Code and API usage logs.
To understand why this pattern recurs, Imoto references a 2026 Anthropic study analyzing millions of Claude Code and API usage logs. The study found a clear split between beginners, who approve every action manually, and experts (750+ sessions), who use auto-approve at a 40%+ rate but maintain a 9% interruption rate — up from 5% — through active monitoring rather than full delegation. The study also found that average session length for auto-approve users grew from under 25 minutes to over 45 minutes across roughly three months, a shift Anthropic attributes not to model upgrades but to humans gradually building trust. Anthropic calls this gap "deployment overhang."
Imoto also cites an analysis by Yoshinori Fukushima, CEO of LayerX, which names three agent failure modes he calls "drifts": premature completion (the agent declares done before it is), self-referential validation (the agent checks its output against its own criteria), and compounding directional drift (individually correct steps that collectively send a project off course). Imoto hit the third drift on a project spanning 20+ story tickets — each completed correctly in isolation, but the integrated system was missing cross-feature tests, had security settings tighter than spec, and had silently dropped a feature from a prior session. His remediation patterns include pre-execution checklists in `CLAUDE.md`, reviewing test cases themselves rather than just test results, and using external tools to validate agent self-reports.
Key facts
- 01Enabling `--dangerously-skip-permissions` in Claude Code caused CI failure within 30 minutes due to the agent writing both buggy code and tests that validated those bugs as correct.
- 02A 2026 Anthropic study of millions of Claude Code and API sessions found experts (750+ sessions) use auto-approve at a 40%+ rate but interrupt the agent 9% of the time — up from 5% — via active monitoring.
- 03Average session length for auto-approve users grew from under 25 minutes to over 45 minutes across roughly three months, driven by human trust growth rather than model upgrades.