Overview

Checklist

Examples

Observability

Last verified: 2026-04-17 · next review in 118 days

You can't improve what you can't see. Agent observability means: every call captured, every trace reconstructable, every dollar attributed, every regression noticed. This page covers what to log, where to send it, and which tools to pick.

What to log

Per LLM call:

Timestamp + request ID (correlation across services)
Model + version (model routing + regression attribution)
Prompt fingerprint or full prompt (reproducibility)
Input tokens + output tokens (cost accounting)
Cache hits / misses (Anthropic returns cache_read_input_tokens, cache_creation_input_tokens)
Latency (total + time-to-first-token for streams)
Tool calls + results (the agent's action trail)
Status + error (retry / backoff signals)

Per agent session:

Session ID that ties all LLM calls + tool uses together
User / tenant ID (for billing, for per-user rate limits)
Final outcome (success / partial / failed)
Total cost (sum of per-call costs)

Observability stack options

Self-hosted / custom (what Agentic Universe does)

Log token counts + cost to your own database (Postgres in our case) on every LLM call. See src/services/pipeline/cost-tracker.ts for the pricing map and calculateCost() helper. Pros: zero vendor cost, full control. Cons: you build your own dashboards.

Langfuse

Open-source LLM observability. Self-hosted or SaaS. Captures traces, prompts, token counts, costs, user feedback; supports eval runs and prompt versioning. Integrates via SDK in minutes.

Pick if: you want a full observability platform without SaaS lock-in and have capacity to self-host
Website: langfuse.com

Helicone

Proxy-based logging. You route requests through Helicone's proxy; every call is logged transparently with cost/cache headers. Minimal code change.

Pick if: you want zero-code-change logging and don't mind a proxy hop
Website: helicone.ai

PromptLayer

Focused on prompt versioning + evals + cost tracking. Strong if you iterate on prompts frequently.

Pick if: prompt iteration is the main activity and you need version diffing
Website: promptlayer.com

Logfire (Pydantic)

OpenTelemetry-based observability from the Pydantic team. Pairs naturally with Pydantic AI. Good fit for Python-heavy stacks.

Pick if: you're on Pydantic AI or want OTel-native tracing
Website: pydantic.dev/logfire

LangSmith

LangChain's first-party observability. Strong if you use LangChain / LangGraph.

Pick if: you use LangChain / LangGraph and want tight integration
Website: smith.langchain.com

Anthropic Console / OpenAI Dashboard

Provider dashboards show per-API-key usage and costs. Enough for small projects; not rich enough for prompt-level debugging.

Minimum viable observability

If you're starting from nothing, ship these three logs to a database you own:

// After every LLM call
await db.insert(llmCalls).values({
  sessionId,
  model: response.model,
  inputTokens: response.usage.input_tokens,
  outputTokens: response.usage.output_tokens,
  cacheReadTokens: response.usage.cache_read_input_tokens ?? 0,
  cacheWriteTokens: response.usage.cache_creation_input_tokens ?? 0,
  latencyMs: Date.now() - startMs,
  toolUse: response.stop_reason === 'tool_use',
  error: null,
  cost: calculateCost(response.model, response.usage.input_tokens, response.usage.output_tokens),
});

That's it. Three queries give you: daily cost, slow calls, error rate. Add dashboards, alerts, and prompt capture as you grow.

Alerts worth setting

Daily cost exceeds budget — cut it off before you wake up to a $500 surprise
Error rate > 5% over the last hour — upstream provider trouble or your code broke
p95 latency > 2× baseline — degradation, could be provider or prompt
Cache hit rate drops — your cache_control block changed; unintended cost spike
Token usage per session > 2× baseline — context bloat or runaway loop

What observability doesn't replace

Evals — observability tells you what happened. Evals tell you whether what happened was good. You need both.
Provider-side audit — Anthropic's Console shows your account-level usage, which reconciles with your own logs. If they disagree, one of you is wrong.

Anti-patterns

Logging full prompts without PII scrubbing — observability platforms become PII leakage vectors. Redact.
No session IDs. You can't reconstruct a trace without one.
Aggregating too early. Store raw per-call events; roll up in queries. Hourly buckets lose the detail you'll need later.
Observability that nobody looks at. Pick 3 dashboards, check them weekly. The rest is decoration.

On this page

Observability

What to log

Observability stack options

Self-hosted / custom (what Agentic Universe does)

Langfuse

Helicone

PromptLayer

Logfire (Pydantic)

LangSmith

Anthropic Console / OpenAI Dashboard

Minimum viable observability

Alerts worth setting

What observability doesn't replace

Anti-patterns

Further reading

On this page

On this page

Observability

What to log

Observability stack options

Self-hosted / custom (what Agentic Universe does)

Langfuse

Helicone

PromptLayer

Logfire (Pydantic)

LangSmith

Anthropic Console / OpenAI Dashboard

Minimum viable observability

Alerts worth setting

What observability doesn't replace

Anti-patterns

Further reading

On this page