Five eval mistakes data scientists keep seeing in AI engineering
Shreya Shankar and Hamel Husain, who have taught evals to over 4,500 people across dozens of companies, presented five common evaluation pitfalls at LangChain's Interrupt conference and explained how applying data science thinking fixes them.
Score breakdown
The talk identifies a concrete regression in evaluation rigor — from data-science-grounded practices to ad hoc LLM-graded metrics — and maps five specific failure modes that teams building on agents are repeating at scale.
- 01Shreya Shankar and Hamel Husain have taught evals to over 4,500 people across dozens of companies.
- 02They argue AI engineering has regressed to 'vibes-based evaluation' compared to rigorous ML engineering practices.
- 03Mistake 1: generic metrics like 'helpfulness' and 'hallucination' are too ambiguous to use off the shelf.
At LangChain's Interrupt conference, Shreya Shankar and Hamel Husain presented a talk titled "The Return of the Data Scientist," arguing that the observability stack underpinning agent systems — logs, metrics, and traces — is fundamentally a data science problem. They contrast ML engineering practices from four years ago, which involved careful data examination, human-label alignment, and deliberate metric design tied to business goals, with what they describe as today's "vibes-based evaluation," where engineers rely on LLM self-grading, off-the-shelf metric packages, and 1-to-100 scoring scales without rigorous design.
They also flag additional pitfalls including use of ROUGE/BLEU metrics, unhelpful judge prompts, raw JSON outputs, and uncalibrated scores.
The five mistakes they identify are: (1) using generic or off-the-shelf metrics like "helpfulness" and "hallucination" that are too ambiguous to be actionable; (2) blindly trusting LLM judges rather than treating them as imperfect classifiers subject to imbalanced classification problems; (3) bad experimental design, including poorly structured synthetic data generation and the use of broad numeric scales instead of binary classification; (4) having the wrong people label data, which leads to criteria drift; and (5) automating evals too heavily, which causes teams to miss the product failures that matter most. They also flag additional pitfalls including use of ROUGE/BLEU metrics, unhelpful judge prompts, raw JSON outputs, and uncalibrated scores. The talk closes with a call to always look at your data, framing the solution as a return to exploratory data analysis, careful metric design, and model validation.
Key facts
- 01Shreya Shankar and Hamel Husain have taught evals to over 4,500 people across dozens of companies.
- 02They argue AI engineering has regressed to 'vibes-based evaluation' compared to rigorous ML engineering practices.
- 03Mistake 1: generic metrics like 'helpfulness' and 'hallucination' are too ambiguous to use off the shelf.
- 04Mistake 2: LLM judges should be treated as imperfect classifiers with train/dev/test splits, not trusted blindly.
- 05Mistake 3: bad experimental design, including use of 1-to-100 scales instead of binary classification problems.
- 06Mistake 4: having the wrong people label data causes criteria drift.
- 07Mistake 5: fully automating evals causes teams to miss product failures that matter most.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 13, 2026 · 08:58 UTC. How this works →