Overview

Checklist

Examples

Latency Optimization

Last verified: 2026-04-17 · next review in 118 days

For realtime / interactive agent UX, latency is as important as cost. A 30-second response kills the feedback loop even if the answer is perfect. This page covers the levers that actually move the needle.

Where latency comes from

A typical agent turn:

Client → Network (10-50ms)
       → Provider queueing (<100ms usually)
       → Prefill (input tokens, ~N×time-per-token)
       → Generation (output tokens, K×time-per-token)
       → Tool-use loop (N× the above)
       → Network back

Two biggest levers: shorten the prefill (less input) and don't wait for full responses (stream). Everything else is optimization on the margins.

1. Streaming (biggest win for UX)

Stream tokens as they generate instead of waiting for the full response. The first token arrives in ~200-500ms on a warm cache; users perceive the agent as fast even if total generation takes 20 seconds.

Anthropic SDK — anthropic.messages.stream() returns an async iterable
OpenAI SDK — stream: true in the params
Vercel AI SDK — streamText() returns a ReadableStream; wire straight to a React component

Streaming matters most for realtime UX. For batch pipelines (news aggregation, eval runs), streaming buys you nothing — use the non-streaming sync API or the batch API.

2. Parallel tool use

Most providers support parallel function calling — the model emits multiple tool calls in a single response, you execute them concurrently, return all results in one message.

// Model asks for 3 file reads in one turn
const toolCalls = response.content.filter((b) => b.type === 'tool_use');

// Run all 3 in parallel
const results = await Promise.all(
  toolCalls.map(async (tc) => ({
    tool_use_id: tc.id,
    content: await runTool(tc.name, tc.input),
  })),
);

// Return all results to the model in the next message

Without parallel tool use, a task requiring 5 file reads takes 5 serial round-trips. With it, 1 round-trip.

3. Prompt caching

Cached tokens cost less and prefill faster. Anthropic's prompt cache hits at ~0.1× input price and skips the prefill compute.

system: [
  {
    type: 'text',
    text: 'Long system prompt + taxonomy + few-shot examples...',
    cache_control: { type: 'ephemeral' },
  },
],

On a 10k-token system prompt with 200 articles processed: first call pays full prefill, the next 199 skip it. Can drop per-call latency by hundreds of milliseconds.

4. Shorten the prefill

The single biggest latency lever you control. Every token in → time. Hard limits:

Truncate article content. Agentic Universe passes 8000 chars, not the full article. Above ~8k chars you're usually just feeding the model boilerplate, navigation, and footer.
Drop file sections. Use offset + limit on Read instead of full files.
Don't restate prior turns. The context already has them.
Prune CLAUDE.md. Every line rides along in every turn.

See Context Management.

5. Model selection

Smaller models are faster. On the Anthropic lineup:

Haiku 4.5 — fastest; classification, routing, high-volume tasks
Sonnet 4.6 — fast enough for interactive; default workhorse
Opus 4.7 — moderate; only for hardest tasks

Route cheap tasks to Haiku. Only send to Opus when the task actually benefits.

6. Region selection

Pick the API endpoint closest to your compute. Most providers offer multiple regions:

Anthropic: US, EU (via Bedrock / Vertex)
OpenAI: global with regional compliance variants
Google Vertex: region-pinned endpoints

A GitHub Actions runner in us-east-1 hitting an Anthropic endpoint in us-east-1 shaves ~30-50ms per round-trip vs cross-Atlantic.

7. Concurrency (not a latency fix, but feels like one)

If your workflow is many independent LLM calls, run them concurrently with Promise.allSettled:

const results = await Promise.allSettled(articles.map((a) => processArticle(a)));

Provider rate limits cap concurrency — check your account's TPM / RPM. Start with 10 concurrent and raise as you gain headroom.

8. Batch API (not latency, but sometimes the right answer)

For non-interactive workloads you don't need low latency — you need high throughput at low cost. Batch API gives both. 50% cheaper than sync, typical completion in 5-30 min for small batches. Interactive agents: sync. Pipelines: batch.

What doesn't help much

"Make the prompt nicer" — marginal
Switching from HTTP to WebSockets — marginal unless you're doing many short turns
Caching response-level outputs (app-level cache) — helps for repeat queries, not novel ones

Anti-patterns

Waiting for full response in a UI — always stream.
Serial tool calls when the model asked for parallel ones — always Promise.all the tool results.
Opus everywhere — pay for speed you don't need, pay in latency you don't want.
Unbounded Read calls — huge files dwarf everything else. Use offset/limit.

On this page

Latency Optimization

Where latency comes from

1. Streaming (biggest win for UX)

2. Parallel tool use

3. Prompt caching

4. Shorten the prefill

5. Model selection

6. Region selection

7. Concurrency (not a latency fix, but feels like one)

8. Batch API (not latency, but sometimes the right answer)

What doesn't help much

Anti-patterns

Further reading

On this page

On this page

Latency Optimization

Where latency comes from

1. Streaming (biggest win for UX)

2. Parallel tool use

3. Prompt caching

4. Shorten the prefill

5. Model selection

6. Region selection

7. Concurrency (not a latency fix, but feels like one)

8. Batch API (not latency, but sometimes the right answer)

What doesn't help much

Anti-patterns

Further reading

On this page