GCF wire format beats JSON on LLM comprehension across 10 models
u/blackwell-systems built a wire format called GCF and benchmarked 10 LLMs across 3 providers on reading and writing it with no prior training, finding GCF achieved 90.7% average comprehension accuracy vs. 53.6% for JSON.
Score breakdown
The benchmark demonstrates that a novel wire format can be read and written by frontier LLMs with zero training and a minimal primer, while substantially outperforming JSON on both comprehension accuracy and token efficiency at scale.
- 01GCF achieved 90.7% average comprehension accuracy vs. 68.5% for TOON and 53.6% for JSON across 10 models
- 02Claude Sonnet 4.6 scored 100% GCF comprehension on every run; Opus 4.6 and Haiku 4.5 each averaged 96.2%
- 03JSON averaged under 62% comprehension accuracy across the Claude family at 500 records
u/blackwell-systems designed a wire format called GCF and evaluated whether LLMs could read and write it without any prior training. The benchmark sent 10 models an identical raw payload of 500 symbols and 200 edges, then posed 13 extraction questions with no format instructions and no system prompt — just the raw data and a question. The evaluation covered 23 comprehension runs and 28 generation runs, totaling 1,300+ evaluations across Anthropic, OpenAI, and Google.
On comprehension, GCF averaged 90.7% accuracy across all 10 models, compared to 68.5% for TOON and 53.6% for JSON.
On comprehension, GCF averaged 90.7% accuracy across all 10 models, compared to 68.5% for TOON and 53.6% for JSON. Among Claude models, Sonnet 4.6 scored 100% on every run, while Opus 4.6 and Haiku 4.5 each averaged 96.2%. JSON averaged under 62% across the Claude family at 500 records. The failure modes also differed qualitatively: GCF errors were typically off by 1–2 (misread section header counts), whereas JSON failures led Opus to spend 143 lines manually enumerating symbols and still return the wrong answer. On generation, all three Claude models produced valid, decoder-parseable GCF output with only a 3-line primer — Opus 4.6 scored 5/5 with zero variance across 2 runs, and Sonnet 4.6 and Haiku 4.5 also scored 5/5. GCF also used significantly fewer input tokens than JSON at the 500-symbol scale (11,090 vs. 53,341). The full methodology, raw logs, spec, and six implementations are published openly.
Key facts
- 01GCF achieved 90.7% average comprehension accuracy vs. 68.5% for TOON and 53.6% for JSON across 10 models
- 02Claude Sonnet 4.6 scored 100% GCF comprehension on every run; Opus 4.6 and Haiku 4.5 each averaged 96.2%
- 03JSON averaged under 62% comprehension accuracy across the Claude family at 500 records
- 04All three Claude models generated valid, decoder-parseable GCF output using only a 3-line primer with zero prior training
- 05GCF used 11,090 input tokens at 500 symbols vs. 53,341 for JSON
- 06The benchmark comprised 1,300+ total evaluations: 23 comprehension runs and 28 generation runs across Anthropic, OpenAI, and Google
- 07Full methodology, raw logs, spec, and 6 implementations are publicly available; whitepaper DOI: 10.5281/zenodo.20579817
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →