HTML token tax costs agents 7x more than plain text
Alex Spinov measured that raw HTML fed to AI agents carries a ~7x token overhead versus plain text, with 85–86% of tokens being markup the model never needs — and shares a 40-line Python fix using only the standard library.
Score breakdown
Strip HTML to plain text before passing web content to agents to cut token costs by ~7x and reclaim context window space for content the model actually reasons over.
- 01A single Wikipedia page cost 48,703 tokens as raw HTML vs. 7,280 tokens as stripped text — a 6.7x reduction (85% less).
- 02The Wikipedia 'Large language model' page (686 KB) measured 221,622 tokens raw vs. 30,988 tokens clean — a 7.2x reduction (86% less).
- 03The markup overhead ratio held steady across all three tested pages, from 528 bytes to 686 KB.
Alex Spinov benchmarked the token cost of passing raw HTML directly into an AI agent's context, finding a consistent ~7x overhead across three pages of very different sizes. The Wikipedia "Web scraping" page (165 KB) consumed 48,703 tokens raw but only 7,280 tokens as stripped text; the Wikipedia "Large language model" page (686 KB) went from 221,622 tokens down to 30,988; and even the minimal `example.com` (528 bytes) dropped from 152 to 22 tokens. All counts used the `o200k_base` tokenizer (the one GPT-4o uses) via `tiktoken`. The markup overhead was roughly proportional across all three sizes, suggesting the tax is structural rather than page-specific.
The proposed fix is a self-contained Python script using `HTMLParser` from the standard library plus `tiktoken`.
The practical cost at GPT-4o input pricing ($2.50/1M tokens as of June 2026) is $0.55 per raw page read versus $0.078 clean — a difference that compounds quickly for agents crawling many pages in a loop, and also fills the context window with noise that displaces useful tokens.
The proposed fix is a self-contained Python script using `HTMLParser` from the standard library plus `tiktoken`. It skips tags in a `SKIP` set (`script`, `style`, `head`, `noscript`, `svg`, `template`) and joins the remaining text. The post also covers a TLS edge case encountered behind a VPN, where `urllib` wraps `ssl.SSLError` inside `urllib.error.URLError`, and the script fails closed rather than silently disabling certificate verification. Spinov is candid about the approach's limits: it loses table structure and link targets, doesn't handle JS-rendered SPAs, and retains nav/footer text — a readability pass would cut further but adds a dependency.
Key facts
- 01A single Wikipedia page cost 48,703 tokens as raw HTML vs. 7,280 tokens as stripped text — a 6.7x reduction (85% less).
- 02The Wikipedia 'Large language model' page (686 KB) measured 221,622 tokens raw vs. 30,988 tokens clean — a 7.2x reduction (86% less).
- 03The markup overhead ratio held steady across all three tested pages, from 528 bytes to 686 KB.
- 04At GPT-4o input pricing of $2.50/1M tokens (June 2026), one raw page read costs $0.55 vs. $0.078 clean.
- 05All token counts used the o200k_base tokenizer via tiktoken — the same tokenizer GPT-4o uses.
- 06The fix is ~40 lines of Python using only HTMLParser from the standard library plus tiktoken — no external APIs.
- 07The approach has known limits: it loses table/link structure, fails on JS-rendered SPAs, and retains nav/footer text.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 09:19 UTC. How this works →