StepPO proposes step-level MDP for agentic RL training
Researchers Daoyu Wang, Qingchuan Li, and Mingyue Cheng argue that LLM agent training should shift from token-level to step-level Markov Decision Processes, introducing StepPO as a framework for step-aligned policy optimization in agentic reinforcement learning.
Score breakdown
Teams training LLM agents with RL-based methods should evaluate whether token-level optimization is the right granularity — StepPO's step-level MDP framing and credit assignment approach offers a concrete alternative designed for multi-turn tool-use and decision-making tasks.
- 01StepPO is authored by Daoyu Wang, Qingchuan Li, and Mingyue Cheng.
- 02The paper argues that the token-level MDP inherited from RLHF and RLVR is inadequate for agentic, multi-turn settings.
- 03StepPO proposes advancing to a step-level MDP formulation, treating the step — not the token — as the fundamental action unit for LLM agents.
Daoyu Wang, Qingchuan Li, and Mingyue Cheng present StepPO, a position paper arguing that the dominant token-centric modeling paradigm in LLM reinforcement learning is fundamentally misaligned with how agents actually behave in multi-turn interactive environments. While methods like RLHF and RLVR optimize at the token level for single-turn alignment or reasoning tasks, agentic settings — such as those powering systems like OpenClaw and Claude Code — require optimizing core capabilities like decision making and tool use across extended, variable-length interactions with delayed and sparse reward signals.
Paired with this formulation is step-level credit assignment, which aligns policy optimization and reward propagation with the granularity of agent decisions rather than individual tokens.
The paper's central contribution is a proposed shift from the conventional token-level Markov Decision Process (MDP) to a step-level MDP formulation, where the "step" is treated as the proper action representation for LLM agents. Paired with this formulation is step-level credit assignment, which aligns policy optimization and reward propagation with the granularity of agent decisions rather than individual tokens. The authors also discuss the key systems design considerations needed to realize step-level agentic RL in practice, and provide preliminary experimental results as initial evidence for the framework's effectiveness. The paper positions StepPO as a conceptual lens intended to help the agentic RL community better understand and improve LLM agent behavior.
Key facts
- 01StepPO is authored by Daoyu Wang, Qingchuan Li, and Mingyue Cheng.
- 02The paper argues that the token-level MDP inherited from RLHF and RLVR is inadequate for agentic, multi-turn settings.
- 03StepPO proposes advancing to a step-level MDP formulation, treating the step — not the token — as the fundamental action unit for LLM agents.
- 04Step-level credit assignment is introduced to align policy optimization and reward propagation with agent decision granularity.
- 05Key challenges addressed include delayed and sparse rewards, and long and variable context in multi-turn interactions.
- 06Agent systems (referred to as 'Harnesses') such as OpenClaw and Claude Code are cited as motivating applications.
- 07Preliminary experiments are included as initial evidence for the step-level perspective's effectiveness.