Lyft's ML lead shares eval framework for 270K monthly AI interactions
Nick Ung, ML lead at Lyft, presented at Interrupt 26 how his team built an evaluation system for AI Assist — their customer care agent platform — covering offline simulation, LLM-as-judge rubrics, and the lessons learned scaling to 270,000 AI interactions per month.
Score breakdown
The talk documents a concrete, production-tested eval architecture that closed the loop between offline simulation and live agent behavior at scale, directly enabling Lyft's resolution rate to climb from 10% to 35%.
- 01Lyft's AI Assist handles 270,000 AI interactions per month across 79 million monthly trips.
- 02The platform has more than seven AI agents in production, built on LangChain, LangGraph, LangSmith, and MCP.
- 03AI resolution rate has grown from 10% to 35% since the program started in 2024; deflection rate is 65%.
Nick Ung, who leads data science and machine learning for safety and customer care at Lyft, presented at Interrupt 26 — LangChain's agent conference — on how his team built an evaluation framework capable of scaling production AI agents. The talk covers the full arc of Lyft's AI Assist journey, which began in 2024 with simple deterministic logic and has since grown to more than seven agents in production serving 270,000 AI interactions per month across 79 million monthly trips. The AI resolution rate has risen from 10% to 35%, with a 65% deflection rate, and Ung emphasized that Lyft holds itself to a high bar — requiring agents to fully resolve customer issues end to end rather than simply blocking customers from reaching human support.
The eval system Ung describes draws on his traditional ML background and applies those principles to agent evaluation.
The eval system Ung describes draws on his traditional ML background and applies those principles to agent evaluation. Core elements include offline eval as a quality gate, a lightweight simulator using mocked MCP outputs, and LLM-as-judge scoring built around task-based rubrics and actionable failure modes rather than scalar metrics. The talk also covers the pitfalls of offline evals — including a "rude awakening" when offline results diverge from production behavior — and how LangSmith is used to manage the overall eval workflow. Agents highlighted include a multi-modal damage claim processing agent that returns a claim decision to drivers in about 15 minutes after photo upload, and a rider-side refund agent that processes more than 80 automation rules and refund logic to contextually explain decisions.
Key facts
- 01Lyft's AI Assist handles 270,000 AI interactions per month across 79 million monthly trips.
- 02The platform has more than seven AI agents in production, built on LangChain, LangGraph, LangSmith, and MCP.
- 03AI resolution rate has grown from 10% to 35% since the program started in 2024; deflection rate is 65%.
- 04The eval system uses offline simulation with mocked MCP outputs as a quality gate before production.
- 05LLM-as-judge scoring is built around task-based rubrics and actionable failure modes rather than scalar metrics.
- 06A multi-modal damage claim agent returns a decision to drivers in about 15 minutes after photo upload.
- 07A rider-side refund agent processes more than 80 automation rules and refund logic to contextually explain decisions.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →