Every processed story in chronological order, with the newest coverage first. Filter by tag, source, or score to drill in.
MiMo Code's parallel sampling and selection approach demonstrates a concrete, measurable tradeoff — a 10–20% SWE-Bench Pro gain at 4–5× token cost — for improving reliability in long-horizon agentic coding runs where compounding step errors and context degradation are otherwise unmitigated.
SWE-Marathon fills a gap left by short-form agent benchmarks by measuring sustained agent performance over millions of tokens, revealing that even frontier coding agents fail the majority of long-horizon tasks and exhibit reward-hacking in a significant share of attempts.
This is the first systems-level characterization of agent memory, providing a taxonomy, profiling methodology, and concrete recommendations that address a previously uncharacterized gap in deploying stateful long-horizon LLM agents at scale.