Jun 9, 2026·1 min readResearch Papers

GCF wire format beats JSON on LLM comprehension across 10 models

u/blackwell-systems built a wire format called GCF and benchmarked 10 LLMs across 3 providers on reading and writing it with no prior training, finding GCF achieved 90.7% average comprehension accuracy vs. 53.6% for JSON.

r/ClaudeAI·u/blackwell-systems

Read at source

Composite

7.6

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The benchmark demonstrates that a novel wire format can be read and written by frontier LLMs with zero training and a minimal primer, while substantially outperforming JSON on both comprehension accuracy and token efficiency at scale.

01GCF achieved 90.7% average comprehension accuracy vs. 68.5% for TOON and 53.6% for JSON across 10 models
02Claude Sonnet 4.6 scored 100% GCF comprehension on every run; Opus 4.6 and Haiku 4.5 each averaged 96.2%
03JSON averaged under 62% comprehension accuracy across the Claude family at 500 records

Summary— our read of the original

u/blackwell-systems designed a wire format called GCF and evaluated whether LLMs could read and write it without any prior training. The benchmark sent 10 models an identical raw payload of 500 symbols and 200 edges, then posed 13 extraction questions with no format instructions and no system prompt — just the raw data and a question. The evaluation covered 23 comprehension runs and 28 generation runs, totaling 1,300+ evaluations across Anthropic, OpenAI, and Google.

On comprehension, GCF averaged 90.7% accuracy across all 10 models, compared to 68.5% for TOON and 53.6% for JSON.

On comprehension, GCF averaged 90.7% accuracy across all 10 models, compared to 68.5% for TOON and 53.6% for JSON. Among Claude models, Sonnet 4.6 scored 100% on every run, while Opus 4.6 and Haiku 4.5 each averaged 96.2%. JSON averaged under 62% across the Claude family at 500 records. The failure modes also differed qualitatively: GCF errors were typically off by 1–2 (misread section header counts), whereas JSON failures led Opus to spend 143 lines manually enumerating symbols and still return the wrong answer. On generation, all three Claude models produced valid, decoder-parseable GCF output with only a 3-line primer — Opus 4.6 scored 5/5 with zero variance across 2 runs, and Sonnet 4.6 and Haiku 4.5 also scored 5/5. GCF also used significantly fewer input tokens than JSON at the 500-symbol scale (11,090 vs. 53,341). The full methodology, raw logs, spec, and six implementations are published openly.

Key facts

01GCF achieved 90.7% average comprehension accuracy vs. 68.5% for TOON and 53.6% for JSON across 10 models
02Claude Sonnet 4.6 scored 100% GCF comprehension on every run; Opus 4.6 and Haiku 4.5 each averaged 96.2%
03JSON averaged under 62% comprehension accuracy across the Claude family at 500 records
04All three Claude models generated valid, decoder-parseable GCF output using only a 3-line primer with zero prior training
05GCF used 11,090 input tokens at 500 symbols vs. 53,341 for JSON
06The benchmark comprised 1,300+ total evaluations: 23 comprehension runs and 28 generation runs across Anthropic, OpenAI, and Google
07Full methodology, raw logs, spec, and 6 implementations are publicly available; whitepaper DOI: 10.5281/zenodo.20579817

Topics

#benchmarks #code-generation #model-evaluation #open-source #data-format

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →

Jun 9, 2026·1 min readResearch Papers

GCF wire format beats JSON on LLM comprehension across 10 models

r/ClaudeAI·u/blackwell-systems

Read at source

Composite

7.6

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01GCF achieved 90.7% average comprehension accuracy vs. 68.5% for TOON and 53.6% for JSON across 10 models
02Claude Sonnet 4.6 scored 100% GCF comprehension on every run; Opus 4.6 and Haiku 4.5 each averaged 96.2%
03JSON averaged under 62% comprehension accuracy across the Claude family at 500 records

Summary— our read of the original

On comprehension, GCF averaged 90.7% accuracy across all 10 models, compared to 68.5% for TOON and 53.6% for JSON.

Key facts

01GCF achieved 90.7% average comprehension accuracy vs. 68.5% for TOON and 53.6% for JSON across 10 models
02Claude Sonnet 4.6 scored 100% GCF comprehension on every run; Opus 4.6 and Haiku 4.5 each averaged 96.2%
03JSON averaged under 62% comprehension accuracy across the Claude family at 500 records
04All three Claude models generated valid, decoder-parseable GCF output using only a 3-line primer with zero prior training
05GCF used 11,090 input tokens at 500 symbols vs. 53,341 for JSON
06The benchmark comprised 1,300+ total evaluations: 23 comprehension runs and 28 generation runs across Anthropic, OpenAI, and Google
07Full methodology, raw logs, spec, and 6 implementations are publicly available; whitepaper DOI: 10.5281/zenodo.20579817

Topics

#benchmarks #code-generation #model-evaluation #open-source #data-format

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.