The eval concretely separates two effects of the Self-Inspect MCP: it reliably increases the visibility of silent agent assumptions mid-task, but does not improve correctness when the task is already well-specified — clarifying where the tool does and does not add value.
At scale (20+ tools), description verbosity costs roughly 4x more context tokens than extra parameters, making description trimming the highest-leverage optimization for large MCP servers.