Pi agent observability dashboard compares Markdown, HTML, and visual specs
IndyDevDan runs three Gemini 3.5 Flash Pi coding agents against the same prompt using three different spec formats — Markdown, HTML, and enhanced visual HTML — and measures the performance, speed, and cost trade-offs live through a Pi agent observability dashboard.
Score breakdown
Measure spec format impact concretely — this experiment shows that switching between Markdown, HTML, and visual HTML specs produces measurable token-cost differences that only an observability layer can surface.
- 01Three Gemini 3.5 Flash Pi coding agents are run against the same prompt with three spec formats: Markdown, HTML, and enhanced visual HTML.
- 02Results are observed live through a Pi agent observability dashboard.
- 03The experiment measures a 'trade-off triangle' of performance, speed, and cost.
IndyDevDan's video centers on a live experiment using the Pi coding agent platform, running three Gemini 3.5 Flash agents against an identical prompt but with three different specification formats: Markdown, HTML, and enhanced visual HTML. The core argument is that without observability tooling, agent cost and quality are unknowable — "you're gambling with tokens." The accompanying code is published at the pi-agent-observability GitHub repository.
The experiment is motivated by two recent industry developments: Anthropic's viral post on the unreasonable effectiveness of HTML for agent specs, and OpenAI's release of GPT Image 2, a benchmark-gapping image model.
The experiment is motivated by two recent industry developments: Anthropic's viral post on the unreasonable effectiveness of HTML for agent specs, and OpenAI's release of GPT Image 2, a benchmark-gapping image model. Rather than accepting either as gospel, the video tests all three spec formats head-to-head and watches the results through a Pi agent observability dashboard. A key data point surfaced during the run: the Markdown agent consumed more tokens than the HTML agent in one trial, which the video presents as evidence that "more useful tokens beat fewer useful tokens" — and that variance like this is only visible through measurement.
Key facts
- 01Three Gemini 3.5 Flash Pi coding agents are run against the same prompt with three spec formats: Markdown, HTML, and enhanced visual HTML.
- 02Results are observed live through a Pi agent observability dashboard.
- 03The experiment measures a 'trade-off triangle' of performance, speed, and cost.
- 04In one run, the Markdown agent burned more tokens than the HTML agent.
- 05The video references Anthropic's post on the 'unreasonable effectiveness of HTML' and OpenAI's GPT Image 2 as motivating context.
- 06Source code is published at the pi-agent-observability GitHub repository.
- 07The video's central thesis is that agent observability is the control surface for engineering — without it, token spend is ungoverned.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 7, 2026 · 12:45 UTC. How this works →