For realtime / interactive agent UX, latency is as important as cost. A 30-second response kills the feedback loop even if the answer is perfect. This page covers the levers that actually move the needle.
A typical agent turn:
Client → Network (10-50ms)
→ Provider queueing (<100ms usually)
→ Prefill (input tokens, ~N×time-per-token)
→ Generation (output tokens, K×time-per-token)
→ Tool-use loop (N× the above)
→ Network backTwo biggest levers: shorten the prefill (less input) and don't wait for full responses (stream). Everything else is optimization on the margins.
Stream tokens as they generate instead of waiting for the full response. The first token arrives in ~200-500ms on a warm cache; users perceive the agent as fast even if total generation takes 20 seconds.
anthropic.messages.stream() returns an async iterablestream: true in the paramsstreamText() returns a ReadableStream; wire straight to a React componentStreaming matters most for realtime UX. For batch pipelines (news aggregation, eval runs), streaming buys you nothing — use the non-streaming sync API or the batch API.
Most providers support parallel function calling — the model emits multiple tool calls in a single response, you execute them concurrently, return all results in one message.
// Model asks for 3 file reads in one turn
const toolCalls = response.content.filter((b) => b.type === 'tool_use');
// Run all 3 in parallel
const results = await Promise.all(
toolCalls.map(async (tc) => ({
tool_use_id: tc.id,
content: await runTool(tc.name, tc.input),
})),
);
// Return all results to the model in the next messageWithout parallel tool use, a task requiring 5 file reads takes 5 serial round-trips. With it, 1 round-trip.
Cached tokens cost less and prefill faster. Anthropic's prompt cache hits at ~0.1× input price and skips the prefill compute.
system: [
{
type: 'text',
text: 'Long system prompt + taxonomy + few-shot examples...',
cache_control: { type: 'ephemeral' },
},
],On a 10k-token system prompt with 200 articles processed: first call pays full prefill, the next 199 skip it. Can drop per-call latency by hundreds of milliseconds.
The single biggest latency lever you control. Every token in → time. Hard limits:
offset + limit on Read instead of full files.See Context Management.
Smaller models are faster. On the Anthropic lineup:
Route cheap tasks to Haiku. Only send to Opus when the task actually benefits.
Pick the API endpoint closest to your compute. Most providers offer multiple regions:
A GitHub Actions runner in us-east-1 hitting an Anthropic endpoint in us-east-1 shaves ~30-50ms per round-trip vs cross-Atlantic.
If your workflow is many independent LLM calls, run them concurrently with Promise.allSettled:
const results = await Promise.allSettled(articles.map((a) => processArticle(a)));Provider rate limits cap concurrency — check your account's TPM / RPM. Start with 10 concurrent and raise as you gain headroom.
For non-interactive workloads you don't need low latency — you need high throughput at low cost. Batch API gives both. 50% cheaper than sync, typical completion in 5-30 min for small batches. Interactive agents: sync. Pipelines: batch.
Promise.all the tool results.Read calls — huge files dwarf everything else. Use offset/limit.Search for a command to run...