Jun 3, 2026·1 min readResearch Papers

Agentic Monte Carlo brings RL-style optimization to black-box LLMs

Researchers propose Agentic Monte Carlo (AMC), a method that uses Sequential Monte Carlo sampling and a learned value function to apply reinforcement-learning-style optimization to black-box LLM agents without modifying their weights.

ArXiv·Dae Yon Hwang, Raunaq Suri, Valentin Villecroze

Read at source

Composite

6.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

AMC demonstrates that principled RL-style optimization of black-box LLM agents is feasible at test time, opening a path to improving proprietary API-only agents without requiring access to model weights.

01AMC is proposed by Dae Yon Hwang, Raunaq Suri, and Valentin Villecroze.
02Black-box LLM agents are accessible only via API, precluding parameter-level RL optimization.
03AMC exploits a known equivalence between RL and Bayesian inference to frame the optimal policy as a posterior over trajectories.

Summary— our read of the original

Dae Yon Hwang, Raunaq Suri, and Valentin Villecroze identify a fundamental gap in LLM agent optimization: while open-weight models can be improved through reinforcement learning, black-box agents backed by proprietary LLMs are accessible only via API, making parameter-level optimization impossible. To bridge this gap, the paper proposes Agentic Monte Carlo (AMC), which reframes the problem using a known equivalence between RL and Bayesian inference. Rather than training the agent, AMC treats the optimal policy as a posterior distribution over trajectories, with the fixed black-box LLM agent serving as the prior. Sequential Monte Carlo (SMC) is then used to sample from this posterior, with a learned value function providing the signal needed to steer trajectory generation at test time.

The method is validated on three diverse environments drawn from the AgentGym benchmark.

The method is validated on three diverse environments drawn from the AgentGym benchmark. AMC demonstrates significant improvements over prompting baselines and, notably, outperforms Group Relative Policy Optimization (GRPO) as the test-time compute budget for AMC is scaled up. The authors argue that AMC establishes the feasibility of principled RL-style optimization for black-box LLM agents. Code is publicly available on GitHub.

Key facts

01AMC is proposed by Dae Yon Hwang, Raunaq Suri, and Valentin Villecroze.
02Black-box LLM agents are accessible only via API, precluding parameter-level RL optimization.
03AMC exploits a known equivalence between RL and Bayesian inference to frame the optimal policy as a posterior over trajectories.
04The fixed black-box LLM agent serves as the prior; Sequential Monte Carlo samples from the posterior.
05A learned value function steers trajectory generation without modifying the underlying model.
06AMC is validated on three environments from the AgentGym benchmark.
07AMC outperforms prompting baselines and surpasses GRPO as test-time compute scales.

Topics

#agent-framework #reinforcement-learning #black-box-agents #monte-carlo #optimization

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 9, 2026 · 17:05 UTC. How this works →

Agentic Monte Carlo brings RL-style optimization to black-box LLMs

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.