Jun 9, 2026·1 min readResearch Papers

Frontier LLMs top out at 68.8% on Office proficiency benchmark

A new benchmark based on China's National Computer Rank Examination reveals that even the best agentic LLM system scores only 68.8% on standardized Word, Excel, and PowerPoint tasks, well below the 95.5% community-reference score.

ArXiv·Tengchao Lv, Dongdong Zhang, Jiayu Ding

Read at source

Composite

6.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The benchmark exposes a large performance gap between current frontier LLM agents and human-level proficiency on standardized Office tasks, demonstrating that fine-grained document automation remains a significant unsolved challenge despite recent advances in code generation.

01Benchmark is derived from China's National Computer Rank Examination (NCRE), covering Word, Excel, and PowerPoint.
02200 comprehensive practical-operation tasks are included, scored against 7,118 machine-gradable criteria.
03Score Rate (SR) is defined as the mean percentage of rubric points earned across all tasks.

Summary— our read of the original

Tengchao Lv, Dongdong Zhang, and Jiayu Ding argue that professional Office software is an ideal testbed for document-automation capability because it demands long-horizon planning and reasoning, precise parameter configuration, and multi-application integration — properties that go well beyond simple code generation. To measure this, they construct an evaluation grounded in China's National Computer Rank Examination (NCRE), presenting 200 comprehensive practical-operation tasks spanning Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, with Score Rate (SR) defined as the mean percentage of rubric points earned across all tasks.

Benchmarking 7 frontier LLMs, the paper finds stark limitations: single-turn models achieve a maximum SR of just 36.6%.

Benchmarking 7 frontier LLMs, the paper finds stark limitations: single-turn models achieve a maximum SR of just 36.6%. An agentic system augmented with execution feedback, iterative repair, and broader Office automation access pushes performance to 68.8%, but this still falls meaningfully short of the 95.5% community-reference score used as a scoring sanity check. The authors conclude that despite recent progress in code generation, achieving reliable fine-grained Office document automation remains a significant open challenge for current LLM and agent systems.

Key facts

01Benchmark is derived from China's National Computer Rank Examination (NCRE), covering Word, Excel, and PowerPoint.
02200 comprehensive practical-operation tasks are included, scored against 7,118 machine-gradable criteria.
03Score Rate (SR) is defined as the mean percentage of rubric points earned across all tasks.
047 frontier LLMs were benchmarked.
05Single-turn models achieved a maximum SR of 36.6%.
06An agentic system with execution feedback, iterative repair, and broader Office automation access reached 68.8% SR.
07The community-reference score used as a sanity check is 95.5%.

Topics

#benchmarks #agent-framework #code-generation #tool-use

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 10, 2026 · 15:34 UTC. How this works →

Jun 9, 2026·1 min readResearch Papers

Frontier LLMs top out at 68.8% on Office proficiency benchmark

ArXiv·Tengchao Lv, Dongdong Zhang, Jiayu Ding

Read at source

Composite

6.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Benchmark is derived from China's National Computer Rank Examination (NCRE), covering Word, Excel, and PowerPoint.
02200 comprehensive practical-operation tasks are included, scored against 7,118 machine-gradable criteria.
03Score Rate (SR) is defined as the mean percentage of rubric points earned across all tasks.

Summary— our read of the original

Benchmarking 7 frontier LLMs, the paper finds stark limitations: single-turn models achieve a maximum SR of just 36.6%.

Key facts

01Benchmark is derived from China's National Computer Rank Examination (NCRE), covering Word, Excel, and PowerPoint.
02200 comprehensive practical-operation tasks are included, scored against 7,118 machine-gradable criteria.
03Score Rate (SR) is defined as the mean percentage of rubric points earned across all tasks.
047 frontier LLMs were benchmarked.
05Single-turn models achieved a maximum SR of 36.6%.
06An agentic system with execution feedback, iterative repair, and broader Office automation access reached 68.8% SR.
07The community-reference score used as a sanity check is 95.5%.

Topics

#benchmarks #agent-framework #code-generation #tool-use

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.