More parallel subagents bloated context, not speed
Adding a 7th subagent pushed orchestrator latency from 22s to 31s because assembling 8 agents' full outputs into a single prompt created a 6,400-token context that took 4+ seconds of CPU time to serialize before any LLM call fired.
Score breakdown
The post demonstrates that in multi-agent fanout pipelines, context assembly before the LLM call — not the LLM itself — can become the dominant latency and cost driver, and that passing only compact summary structs rather than full subagent outputs resolves both problems simultaneously.
- 01Adding a 7th subagent pushed total orchestrator latency from 22s to 31s
- 02Individual subagents finished in 9–12 seconds regardless of how many ran in parallel
- 03With 8 subagents each returning ~800 tokens, the aggregation context reached ~6,400 tokens (52,480 bytes)
강해수 describes a production fanout orchestration pattern in an ad-creative analysis SaaS where N subagents run in parallel, each returning a full analysis, and an orchestrator LLM merges the results into a single verdict. The subagents themselves were not the problem — they completed in 9–12 seconds regardless of how many ran concurrently. The bottleneck emerged in the aggregation step: with 8 subagents each returning ~800 tokens, the orchestrator assembled a 6,400-token (52,480-byte) context before it could issue its first LLM call. On Cloudflare Workers, serializing 8 JSON blobs into a single prompt string consumed 4+ seconds of pure CPU time, logged as `context_assembly_backpressure`. Production data collected over 3 weeks showed aggregation's share of total wall-clock time growing from 18% at 2 subagents (14.2s total) to 61% at 8 subagents (31.1s total), with the inflection point at 6+ subagents where aggregation consumed more than half the latency.
Instead, each subagent now writes its full output to R2 on completion, and the orchestrator reads only a three-field summary struct per agent — verdict, confidence, and top_signal.
The fix did not reduce parallelism. Instead, each subagent now writes its full output to R2 on completion, and the orchestrator reads only a three-field summary struct per agent — verdict, confidence, and top_signal. Eight agents still produce eight files, but the aggregation context shrank from ~6,400 tokens to ~1,100, and the monthly cost for that pipeline step dropped from $207 to $38. The post also references an R2 chunking pattern, a D1 counter approach for tracking partial completions without polling, and a KV-based loop guard for failed aggregation retries, with full details on riversealab.com.
Key facts
- 01Adding a 7th subagent pushed total orchestrator latency from 22s to 31s
- 02Individual subagents finished in 9–12 seconds regardless of how many ran in parallel
- 03With 8 subagents each returning ~800 tokens, the aggregation context reached ~6,400 tokens (52,480 bytes)
- 04Serializing 8 JSON blobs on Cloudflare Workers took 4+ seconds of CPU time before any LLM call fired
- 05At 8 subagents, aggregation consumed 61% of total wall-clock time (31.1s total)
- 06Switching to a three-field summary struct per agent dropped aggregation context from ~6,400 to ~1,100 tokens
- 07Monthly cost for the pipeline step fell from $207 to $38 after the fix
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 17, 2026 · 10:39 UTC. How this works →