OpenAI's evals lead on why old benchmarks are breaking down
Tejal Patwardhan, who leads OpenAI's frontier evals team, discusses how rapidly improving models are saturating existing benchmarks and why building new, harder evaluations has become central to understanding AI progress.
Score breakdown
As frontier models saturate existing benchmarks, the work of designing harder, more meaningful evaluations becomes the primary mechanism by which the field can track — and anticipate — the pace of AI capability growth.
- 01Tejal Patwardhan leads OpenAI's frontier evals team and joined OpenAI in fall 2023.
- 02She joined as part of the preparedness team, focused on understanding how capable models were becoming.
- 03Early reasoning model results were emerging shortly after she joined.
Tejal Patwardhan leads OpenAI's frontier evals team and joined the company in fall 2023, initially working on the preparedness team as early reasoning model results were beginning to surface. Speaking with host Andrew Mayne on the OpenAI podcast, she describes evals as a way to measure and understand model capabilities — and to anticipate progress before it becomes widely visible. She introduces the concept of "capability overhang," the idea that models become capable of things long before people actually adopt or use them for those capabilities, due to cultural, legal, or regulatory barriers.
Patwardhan explains that a core challenge in her work is that old benchmarks are becoming too easy — models are saturating them, making them unreliable signals of true capability.
Patwardhan explains that a core challenge in her work is that old benchmarks are becoming too easy — models are saturating them, making them unreliable signals of true capability. She discusses what makes a good benchmark, why evals are getting harder to design, and how OpenAI is approaching the measurement of newer modalities like voice and vision, as well as testing models on real scientific problems. A recurring theme is her own recalibration: she notes that her team was nervous about whether a model would beat a human baseline on a difficult eval, concluding that they "should never underestimate the model."
Key facts
- 01Tejal Patwardhan leads OpenAI's frontier evals team and joined OpenAI in fall 2023.
- 02She joined as part of the preparedness team, focused on understanding how capable models were becoming.
- 03Early reasoning model results were emerging shortly after she joined.
- 04She describes 'capability overhang' — models becoming capable of things before people adopt them, due to cultural, legal, or regulatory barriers.
- 05Old benchmarks are getting saturated as models improve, making them unreliable measures of progress.
- 06The team is developing evals for voice and vision models and for real scientific tasks.
- 07Patwardhan says her team learned they 'should never underestimate the model' after a model beat a human baseline they expected to be too hard.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →