New benchmark exposes LLMs' critical gap in threat hunting
A new benchmark called the Cyber Defense Benchmark reveals that frontier LLMs fail dramatically at autonomous threat hunting in security operations, with the best model correctly flagging only 3.8% of malicious events on average.
Score breakdown
Security teams and AI practitioners evaluating LLMs for autonomous SOC deployment should treat this benchmark as a warning: even the most capable frontier models today cannot reliably perform unsupervised threat hunting on real log data.
- 01The Cyber Defense Benchmark tests LLM agents on threat hunting using raw Windows event logs with no hints or guided questions.
- 02It covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics.
- 03Each episode gives the agent an in-memory SQLite database of 75,000–135,000 log records; agents must submit SQL queries to find and flag malicious event timestamps.
Alankrit Chona, Igor Kozlov, and Ambuj Kumar introduce the Cyber Defense Benchmark, a rigorous evaluation framework designed to measure how well LLM agents can perform threat hunting — a core SOC analyst task. Unlike existing security benchmarks that rely on guided questions, this benchmark presents agents with an in-memory SQLite database of 75,000–135,000 raw Windows event log records and requires them to iteratively submit SQL queries, discover malicious event timestamps, and explicitly flag them. Scoring is CTF-style, validated against Sigma-rule-derived ground truth. The benchmark covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics, and runs inside a Gymnasium reinforcement-learning environment using a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings.
Five frontier models were evaluated across 26 campaigns covering 105 of the 106 procedures: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.
Five frontier models were evaluated across 26 campaigns covering 105 of the 106 procedures: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash. All models failed dramatically. The best-performing model, Claude Opus 4.6, submitted correct flags for only 3.8% of malicious events on average, and no single run by any model found all flags in any campaign. The authors define a passing score as achieving ≥50% recall on every ATT&CK tactic — the minimum bar they consider necessary for unsupervised SOC deployment. No model passed: the leader cleared this bar on only 5 of 13 tactics, while the remaining four models cleared it on zero tactics.
The paper's central finding is that strong performance on curated security Q&A benchmarks does not translate to open-ended, evidence-driven threat hunting. The gap between benchmark-friendly tasks and real SOC workflows remains wide, suggesting that current LLM agents are not yet viable for autonomous deployment in security operations centers without significant human oversight.
Key facts
- 01The Cyber Defense Benchmark tests LLM agents on threat hunting using raw Windows event logs with no hints or guided questions.
- 02It covers 106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics.
- 03Each episode gives the agent an in-memory SQLite database of 75,000–135,000 log records; agents must submit SQL queries to find and flag malicious event timestamps.
- 04Five frontier models were evaluated: Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash.