The experiment shows that on adversarial judgment tasks with real stakes and no answer key, model capability gaps are concrete and specific — particularly around whether a model treats the open web as part of its audit scope — rather than abstract or benchmark-only differences.