Gemini 2.5 Flash returns truncated responses due to reasoning token budget
Gemini 2.5 Flash and Pro allocate most of the max_output_tokens budget to internal reasoning (tracked as thoughtsTokenCount), leaving little for actual output, causing responses to be cut short mid-sentence.
Score breakdown
Developers using Gemini 2.5 Flash or Pro must explicitly control the thinking budget to avoid silent truncation; without this knowledge, production endpoints will return incomplete responses with no error message, breaking downstream applications.
- 01Gemini 2.5 Flash and Pro count internal reasoning tokens (thoughtsTokenCount) against the max_output_tokens budget, unlike OpenAI's o-series models
- 02By default, these models use dynamic thinking and allocate 90–98% of the token budget to reasoning, leaving minimal tokens for visible output
- 03Disable thinking with thinking_budget: 0 for Flash (Pro requires minimum 128) or use reasoning_effort: "none" via the OpenAI-compatible endpoint
Gemini 2.5 Flash and Pro implement internal reasoning similar to OpenAI's o-series models, but with a critical difference: thinking tokens count against the max_output_tokens budget. When you set max_output_tokens to 1000, the model allocates most of that budget to internal reasoning (tracked as thoughtsTokenCount in the API response), leaving only a small portion for the actual text output (candidatesTokenCount). For non-trivial tasks, the model defaults to dynamic thinking and will consume 90–98% of the budget on reasoning, causing the response to terminate with finish_reason: "MAX_TOKENS" and minimal visible output.
First, disable thinking entirely using `thinking_budget: 0` for Flash (Pro requires a minimum of 128).
The article provides three strategies with different tradeoffs. First, disable thinking entirely using `thinking_budget: 0` for Flash (Pro requires a minimum of 128). Second, cap thinking with a fixed budget like `thinking_budget: 1024` and set max_output_tokens higher (e.g., 8192) for most production workloads. Third, allow dynamic thinking with max_output_tokens set to 8K or higher for research and complex analysis, accepting slower latency and higher costs. The fix also varies by SDK: use `thinking_config=types.ThinkingConfig(thinking_budget=0)` in the Google GenAI SDK, or `reasoning_effort="none"` when accessing Gemini through the OpenAI-compatible endpoint. The mapping between reasoning_effort levels and thinking_budget values is: "none" → 0 (Flash) or 128 (Pro), "low" → 1024, "medium" → 8192, "high" → 24576.
Key facts
- 01Gemini 2.5 Flash and Pro count internal reasoning tokens (thoughtsTokenCount) against the max_output_tokens budget, unlike OpenAI's o-series models
- 02By default, these models use dynamic thinking and allocate 90–98% of the token budget to reasoning, leaving minimal tokens for visible output
- 03Disable thinking with thinking_budget: 0 for Flash (Pro requires minimum 128) or use reasoning_effort: "none" via the OpenAI-compatible endpoint
- 04The bug manifests differently across SDKs: LangChain silently truncates output, and the fix depends on whether you use the Google GenAI SDK or OpenAI-compatible endpoint
- 05Three strategies exist with tradeoffs: disable thinking (fast, low cost, lower quality), cap thinking (medium latency/cost, good quality), or allow dynamic thinking (slowest, highest cost, best quality)