Apr 19, 2026·1 min readTutorials & How-To

Gemini 2.5 Flash returns truncated responses due to reasoning token budget

Gemini 2.5 Flash and Pro allocate most of the max_output_tokens budget to internal reasoning (tracked as thoughtsTokenCount), leaving little for actual output, causing responses to be cut short mid-sentence.

Dev.to #ai·Devansh

Read at source

Composite

5.2

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

Developers using Gemini 2.5 Flash or Pro must explicitly control the thinking budget to avoid silent truncation; without this knowledge, production endpoints will return incomplete responses with no error message, breaking downstream applications.

01Gemini 2.5 Flash and Pro count internal reasoning tokens (thoughtsTokenCount) against the max_output_tokens budget, unlike OpenAI's o-series models
02By default, these models use dynamic thinking and allocate 90–98% of the token budget to reasoning, leaving minimal tokens for visible output
03Disable thinking with thinking_budget: 0 for Flash (Pro requires minimum 128) or use reasoning_effort: "none" via the OpenAI-compatible endpoint

Summary— our read of the original

Gemini 2.5 Flash and Pro implement internal reasoning similar to OpenAI's o-series models, but with a critical difference: thinking tokens count against the max_output_tokens budget. When you set max_output_tokens to 1000, the model allocates most of that budget to internal reasoning (tracked as thoughtsTokenCount in the API response), leaving only a small portion for the actual text output (candidatesTokenCount). For non-trivial tasks, the model defaults to dynamic thinking and will consume 90–98% of the budget on reasoning, causing the response to terminate with finish_reason: "MAX_TOKENS" and minimal visible output.

First, disable thinking entirely using `thinking_budget: 0` for Flash (Pro requires a minimum of 128).

The article provides three strategies with different tradeoffs. First, disable thinking entirely using `thinking_budget: 0` for Flash (Pro requires a minimum of 128). Second, cap thinking with a fixed budget like `thinking_budget: 1024` and set max_output_tokens higher (e.g., 8192) for most production workloads. Third, allow dynamic thinking with max_output_tokens set to 8K or higher for research and complex analysis, accepting slower latency and higher costs. The fix also varies by SDK: use `thinking_config=types.ThinkingConfig(thinking_budget=0)` in the Google GenAI SDK, or `reasoning_effort="none"` when accessing Gemini through the OpenAI-compatible endpoint. The mapping between reasoning_effort levels and thinking_budget values is: "none" → 0 (Flash) or 128 (Pro), "low" → 1024, "medium" → 8192, "high" → 24576.

Key facts

01Gemini 2.5 Flash and Pro count internal reasoning tokens (thoughtsTokenCount) against the max_output_tokens budget, unlike OpenAI's o-series models
02By default, these models use dynamic thinking and allocate 90–98% of the token budget to reasoning, leaving minimal tokens for visible output
03Disable thinking with thinking_budget: 0 for Flash (Pro requires minimum 128) or use reasoning_effort: "none" via the OpenAI-compatible endpoint
04The bug manifests differently across SDKs: LangChain silently truncates output, and the fix depends on whether you use the Google GenAI SDK or OpenAI-compatible endpoint
05Three strategies exist with tradeoffs: disable thinking (fast, low cost, lower quality), cap thinking (medium latency/cost, good quality), or allow dynamic thinking (slowest, highest cost, best quality)

Topics

#gemini #reasoning-models #token-management #api-debugging #model-behavior

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Apr 20, 2026 · 00:31 UTC. How this works →

Apr 19, 2026·1 min readTutorials & How-To

Gemini 2.5 Flash returns truncated responses due to reasoning token budget

Dev.to #ai·Devansh

Read at source

Composite

5.2

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01Gemini 2.5 Flash and Pro count internal reasoning tokens (thoughtsTokenCount) against the max_output_tokens budget, unlike OpenAI's o-series models
02By default, these models use dynamic thinking and allocate 90–98% of the token budget to reasoning, leaving minimal tokens for visible output
03Disable thinking with thinking_budget: 0 for Flash (Pro requires minimum 128) or use reasoning_effort: "none" via the OpenAI-compatible endpoint

Summary— our read of the original

First, disable thinking entirely using `thinking_budget: 0` for Flash (Pro requires a minimum of 128).

Key facts

01Gemini 2.5 Flash and Pro count internal reasoning tokens (thoughtsTokenCount) against the max_output_tokens budget, unlike OpenAI's o-series models
02By default, these models use dynamic thinking and allocate 90–98% of the token budget to reasoning, leaving minimal tokens for visible output
03Disable thinking with thinking_budget: 0 for Flash (Pro requires minimum 128) or use reasoning_effort: "none" via the OpenAI-compatible endpoint
04The bug manifests differently across SDKs: LangChain silently truncates output, and the fix depends on whether you use the Google GenAI SDK or OpenAI-compatible endpoint
05Three strategies exist with tradeoffs: disable thinking (fast, low cost, lower quality), cap thinking (medium latency/cost, good quality), or allow dynamic thinking (slowest, highest cost, best quality)

Topics

#gemini #reasoning-models #token-management #api-debugging #model-behavior

Methodology

Score breakdown

Key facts

Topics

Score breakdown

Key facts

Topics