Jun 7, 2026·1 min readApplications & Use Cases

Arize AI's Dat Ngo outlines LLM observability and eval strategies

Dat Ngo from Arize AI walks through what observability, evaluation, and experimentation actually look like when debugging nondeterministic AI agents, covering five types of eval signal and the open-source Arize Phoenix platform.

YouTube: AI Engineer·AI Engineer

Read at source

Composite

4.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

The talk illustrates why standard code-level debugging is insufficient for agentic systems and presents a concrete framework — spanning telemetry, multi-scope evals, and automated analysis — for making nondeterministic AI agents production-ready.

01Dat Ngo is an AI architect at Arize AI who works with large enterprises on observability, evaluation, and experimentation.
02He identifies five flavors of eval signal: LLM as judge, human feedback, golden datasets, deterministic checks, and business metrics.
03Evals can be scoped at four levels: single span, multispan, trajectory, and session.

Summary— our read of the original

Dat Ngo, an AI architect at Arize AI, delivered a talk at AI Engineer covering the practical realities of observability, evaluation, and experimentation for AI agents and harnesses. He frames the core challenge as nondeterminism: unlike traditional software, agents don't follow a fixed execution path, so code itself cannot audit what an agent did — only telemetry can. Arize builds on OpenTelemetry (Otel) as a foundational pattern, allowing a single line of code via an auto-instrumenter to generate traces and spans across frameworks and SDKs, producing a structured audit record of agent behavior.

He also flags a key pitfall in nondeterministic systems: fixing one perceived bug can silently introduce two or three regressions elsewhere, making continuous eval coverage essential.

Ngo describes five flavors of eval signal — LLM as judge, human feedback, golden datasets, deterministic checks, and business metrics — and explains that evals can be scoped at the level of a single span, multispan, trajectory, or session. He also flags a key pitfall in nondeterministic systems: fixing one perceived bug can silently introduce two or three regressions elsewhere, making continuous eval coverage essential.

On the tooling side, Arize Phoenix is open source and runs as a single container with no Kubernetes required. The enterprise offering adds an AI layer called Alex, which scans traces, surfaces high latency and errors, and creates evals automatically. Ngo describes the long-term goal as automating engineers out of the observability loop entirely. The transcript is truncated before the full talk concludes.

Key facts

01Dat Ngo is an AI architect at Arize AI who works with large enterprises on observability, evaluation, and experimentation.
02He identifies five flavors of eval signal: LLM as judge, human feedback, golden datasets, deterministic checks, and business metrics.
03Evals can be scoped at four levels: single span, multispan, trajectory, and session.
04Arize builds on OpenTelemetry (Otel); a single auto-instrumenter line of code generates traces and spans across frameworks.
05Arize Phoenix is open source and runs as a single container with no Kubernetes required.
06The enterprise product includes an AI layer called Alex that scans traces, surfaces high latency and errors, and auto-creates evals.
07Ngo warns that fixing one issue in a nondeterministic system can silently introduce multiple regressions.

Topics

#observability #agent-framework #evaluation #monitoring #telemetry

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →

Jun 7, 2026·1 min readApplications & Use Cases

Arize AI's Dat Ngo outlines LLM observability and eval strategies

YouTube: AI Engineer·AI Engineer

Read at source

Composite

4.9

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Dat Ngo is an AI architect at Arize AI who works with large enterprises on observability, evaluation, and experimentation.
02He identifies five flavors of eval signal: LLM as judge, human feedback, golden datasets, deterministic checks, and business metrics.
03Evals can be scoped at four levels: single span, multispan, trajectory, and session.

Summary— our read of the original

He also flags a key pitfall in nondeterministic systems: fixing one perceived bug can silently introduce two or three regressions elsewhere, making continuous eval coverage essential.

Key facts

01Dat Ngo is an AI architect at Arize AI who works with large enterprises on observability, evaluation, and experimentation.
02He identifies five flavors of eval signal: LLM as judge, human feedback, golden datasets, deterministic checks, and business metrics.
03Evals can be scoped at four levels: single span, multispan, trajectory, and session.
04Arize builds on OpenTelemetry (Otel); a single auto-instrumenter line of code generates traces and spans across frameworks.
05Arize Phoenix is open source and runs as a single container with no Kubernetes required.
06The enterprise product includes an AI layer called Alex that scans traces, surfaces high latency and errors, and auto-creates evals.
07Ngo warns that fixing one issue in a nondeterministic system can silently introduce multiple regressions.

Topics

#observability #agent-framework #evaluation #monitoring #telemetry

Methodology

Score breakdown

Key facts

Topics

More in Applications & Use Cases.

Score breakdown

Key facts

Topics

More in Applications & Use Cases.