ASSAY framework boosts LLM agents by matching skills to tasks at inference time
Researchers introduce ASSAY, a framework that measures per-skill causal contributions and suppresses harmful skills at inference time, achieving state-of-the-art results on two agent benchmarks without any model weight updates.
Score breakdown
ASSAY demonstrates that matching skills to tasks at inference time — rather than global library curation — is the key bottleneck for experience-based agent improvement, achieving state-of-the-art results on two benchmarks without any weight updates.
- 01ASSAY separates skill generation from skill curation, using randomized masking to measure per-skill causal contributions.
- 02Skill libraries exhibit pervasive causal heterogeneity: individual skills help on some task types while hurting on others, with effects that cancel in aggregate.
- 03ASSAY suppresses skills with a negative predicted effect for each specific test task at inference time, without modifying model weights.
Yixuan Wang, Yiyang Zhou, and Yiming Liang present ASSAY, a framework addressing a fundamental flaw in how LLM agents manage accumulated experience. Current systems let LLM judgment handle both generating skills from experience and deciding which skills to keep — but the authors argue these are distinct roles. Generating a skill is a creative act suited to LLM judgment, while determining whether a skill actually improves performance requires empirical, causal evidence across many tasks. Using randomized masking to measure per-skill causal contributions, the authors find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods.
The framework was evaluated across seven base models spanning four providers on two benchmarks, AppWorld and tau-bench.
ASSAY addresses this by computing per-skill causal attribution on a small development set, restructuring the library offline, and suppressing skills with a negative predicted effect for each specific test task at inference time. The framework was evaluated across seven base models spanning four providers on two benchmarks, AppWorld and tau-bench. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion, a 47.4% relative improvement and a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, surpassing o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation studies confirm that the dominant gain comes from per-task masking at inference time, not from globally removing bad skills, pinpointing the core bottleneck as matching skills to tasks rather than library-wide curation. Code is available at https://github.com/aiming-lab/assay.
Key facts
- 01ASSAY separates skill generation from skill curation, using randomized masking to measure per-skill causal contributions.
- 02Skill libraries exhibit pervasive causal heterogeneity: individual skills help on some task types while hurting on others, with effects that cancel in aggregate.
- 03ASSAY suppresses skills with a negative predicted effect for each specific test task at inference time, without modifying model weights.
- 04Evaluated across seven base models spanning four providers on two benchmarks: AppWorld and tau-bench.
- 05DeepSeek-V3 achieves 69.3% task-goal completion on AppWorld's hardest split — a 47.4% relative improvement and a new state of the art, including over weight-tuned approaches.
- 06GPT-4.1 improves by 8.7% relative on tau-bench retail, surpassing o4-mini, o1, and GPT-4.5 on the public leaderboard.
- 07Ablation studies confirm the dominant gain comes from per-task masking at inference time, not global removal of bad skills.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →