Apr 21, 2026·1 min readResearch Papers

New benchmark exposes LLMs' critical gap in threat hunting

A new benchmark called the Cyber Defense Benchmark reveals that frontier LLMs fail dramatically at autonomous threat hunting in security operations, with the best model correctly flagging only 3.8% of malicious events on average.

ArXiv·Alankrit Chona, Igor Kozlov, Ambuj Kumar

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Security teams and AI practitioners evaluating LLMs for autonomous SOC deployment should treat this benchmark as a warning: even the most capable frontier models today cannot reliably perform unsupervised threat hunting on real log data.

01The Cyber Defense Benchmark tests LLM agents on threat hunting using raw Windows event logs with no hints or guided questions.
02It covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics.
03Each episode gives the agent an in-memory SQLite database of 75,000–135,000 log records; agents must submit SQL queries to find and flag malicious event timestamps.

Summary— our read of the original

Alankrit Chona, Igor Kozlov, and Ambuj Kumar introduce the Cyber Defense Benchmark, a rigorous evaluation framework designed to measure how well LLM agents can perform threat hunting — a core SOC analyst task. Unlike existing security benchmarks that rely on guided questions, this benchmark presents agents with an in-memory SQLite database of 75,000–135,000 raw Windows event log records and requires them to iteratively submit SQL queries, discover malicious event timestamps, and explicitly flag them. Scoring is CTF-style, validated against Sigma-rule-derived ground truth. The benchmark covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics, and runs inside a Gymnasium reinforcement-learning environment using a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings.

Five frontier models were evaluated across 26 campaigns covering 105 of the 106 procedures: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.

Five frontier models were evaluated across 26 campaigns covering 105 of the 106 procedures: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash. All models failed dramatically. The best-performing model, Claude Opus 4.6, submitted correct flags for only 3.8% of malicious events on average, and no single run by any model found all flags in any campaign. The authors define a passing score as achieving ≥50% recall on every ATT&CK tactic — the minimum bar they consider necessary for unsupervised SOC deployment. No model passed: the leader cleared this bar on only 5 of 13 tactics, while the remaining four models cleared it on zero tactics.

The paper's central finding is that strong performance on curated security Q&A benchmarks does not translate to open-ended, evidence-driven threat hunting. The gap between benchmark-friendly tasks and real SOC workflows remains wide, suggesting that current LLM agents are not yet viable for autonomous deployment in security operations centers without significant human oversight.

Key facts

01The Cyber Defense Benchmark tests LLM agents on threat hunting using raw Windows event logs with no hints or guided questions.
02It covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics.
03Each episode gives the agent an in-memory SQLite database of 75,000–135,000 log records; agents must submit SQL queries to find and flag malicious event timestamps.
04Five frontier models were evaluated: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.
05The best model, Claude Opus 4.6, correctly flagged only 3.8% of malicious events on average; no run by any model found all flags.
06A passing score is defined as ≥50% recall on every ATT&CK tactic; no model passed, and the leader cleared this bar on only 5 of 13 tactics.
07The authors conclude current LLMs are poorly suited for open-ended threat hunting despite strong performance on curated security Q&A benchmarks.

Topics

#benchmarks #agent-framework #safety #tool-use

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 22, 2026 · 19:13 UTC. How this works →

Apr 21, 2026·1 min readResearch Papers

New benchmark exposes LLMs' critical gap in threat hunting

ArXiv·Alankrit Chona, Igor Kozlov, Ambuj Kumar

Read at source

Composite

6.7

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01The Cyber Defense Benchmark tests LLM agents on threat hunting using raw Windows event logs with no hints or guided questions.
02It covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics.
03Each episode gives the agent an in-memory SQLite database of 75,000–135,000 log records; agents must submit SQL queries to find and flag malicious event timestamps.

Summary— our read of the original

Five frontier models were evaluated across 26 campaigns covering 105 of the 106 procedures: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.

Key facts

01The Cyber Defense Benchmark tests LLM agents on threat hunting using raw Windows event logs with no hints or guided questions.
02It covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics.
03Each episode gives the agent an in-memory SQLite database of 75,000–135,000 log records; agents must submit SQL queries to find and flag malicious event timestamps.
04Five frontier models were evaluated: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.
05The best model, Claude Opus 4.6, correctly flagged only 3.8% of malicious events on average; no run by any model found all flags.
06A passing score is defined as ≥50% recall on every ATT&CK tactic; no model passed, and the leader cleared this bar on only 5 of 13 tactics.
07The authors conclude current LLMs are poorly suited for open-ended threat hunting despite strong performance on curated security Q&A benchmarks.

Topics

#benchmarks #agent-framework #safety #tool-use

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics