UC Berkeley's Agents' Last Exam stumps top AI models, with GPT-5.5 topping out at 24%
A new UC Berkeley-led benchmark called Agents' Last Exam (ALE) tested leading AI models across more than 50 industries, with OpenAI's GPT-5.5 achieving the highest pass rate at just 24%.
Score breakdown
ALE's sub-25% pass rates across all leading models reveal a substantial gap between current AI capabilities and reliable real-world task performance across professional domains.
- 01OpenAI's GPT-5.5 scored the highest of all models tested, with a 24% pass rate.
- 02Anthropic's Claude Fable 5 scored 22%, the second-highest result.
- 03Google Gemini, DeepSeek, and Grok all scored below 16%.
UC Berkeley's Center for Responsible, Decentralized Intelligence, co-directed by computer science professor Dawn Song and Haas School of Business professor Christine Parlour, has released a new AI benchmark called Agents' Last Exam (ALE). Developed with input from more than 300 industry experts and supported by 13 advisers from academia and industry, ALE tests AI agents on real-world tasks spanning more than 50 disciplines — ranging from audio processing to theoretical physics. Pass rates reflect runs in which an AI agent achieves a perfect score across all assigned tasks.
Among the models evaluated, OpenAI's GPT-5.5 scored the highest at a 24% pass rate, with Anthropic's Claude Fable 5 close behind at 22%.
Among the models evaluated, OpenAI's GPT-5.5 scored the highest at a 24% pass rate, with Anthropic's Claude Fable 5 close behind at 22%. Google Gemini, DeepSeek, and Grok all fell below 16%. ALE distinguishes itself from other benchmarks by covering a broader range of disciplines and by routinely updating its tasks to minimize contamination — the overlap between training and evaluation data that can cause artificially inflated scores. Yiyou Sun, a UC Berkeley postdoc who leads the ALE project from Song's group, described the tasks as "actual jobs that experts have worked on," framing the benchmark as a tool for tracking AI progress in areas that are "GDP relevant."
Key facts
- 01OpenAI's GPT-5.5 scored the highest of all models tested, with a 24% pass rate.
- 02Anthropic's Claude Fable 5 scored 22%, the second-highest result.
- 03Google Gemini, DeepSeek, and Grok all scored below 16%.
- 04ALE covers tasks across more than 50 industries, from audio processing to theoretical physics.
- 05The benchmark was developed with more than 300 industry experts and has 13 academic and industry advisers.
- 06Tasks are routinely updated to minimize data contamination, which can inflate benchmark scores.
- 07ALE is led by the Berkeley Center for Responsible, Decentralized Intelligence, co-directed by professors Dawn Song and Christine Parlour.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →