Jun 13, 2026·1 min readResearch Papers

ASSAY framework boosts LLM agents by matching skills to tasks at inference time

Researchers introduce ASSAY, a framework that measures per-skill causal contributions and suppresses harmful skills at inference time, achieving state-of-the-art results on two agent benchmarks without any model weight updates.

ArXiv·Yixuan Wang, Yiyang Zhou, Yiming Liang

Read at source

Composite

6.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

ASSAY demonstrates that matching skills to tasks at inference time — rather than global library curation — is the key bottleneck for experience-based agent improvement, achieving state-of-the-art results on two benchmarks without any weight updates.

01ASSAY separates skill generation from skill curation, using randomized masking to measure per-skill causal contributions.
02Skill libraries exhibit pervasive causal heterogeneity: individual skills help on some task types while hurting on others, with effects that cancel in aggregate.
03ASSAY suppresses skills with a negative predicted effect for each specific test task at inference time, without modifying model weights.

Summary— our read of the original

Yixuan Wang, Yiyang Zhou, and Yiming Liang present ASSAY, a framework addressing a fundamental flaw in how LLM agents manage accumulated experience. Current systems let LLM judgment handle both generating skills from experience and deciding which skills to keep — but the authors argue these are distinct roles. Generating a skill is a creative act suited to LLM judgment, while determining whether a skill actually improves performance requires empirical, causal evidence across many tasks. Using randomized masking to measure per-skill causal contributions, the authors find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods.

The framework was evaluated across seven base models spanning four providers on two benchmarks, AppWorld and tau-bench.

ASSAY addresses this by computing per-skill causal attribution on a small development set, restructuring the library offline, and suppressing skills with a negative predicted effect for each specific test task at inference time. The framework was evaluated across seven base models spanning four providers on two benchmarks, AppWorld and tau-bench. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion, a 47.4% relative improvement and a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, surpassing o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation studies confirm that the dominant gain comes from per-task masking at inference time, not from globally removing bad skills, pinpointing the core bottleneck as matching skills to tasks rather than library-wide curation. Code is available at https://github.com/aiming-lab/assay.

Key facts

01ASSAY separates skill generation from skill curation, using randomized masking to measure per-skill causal contributions.
02Skill libraries exhibit pervasive causal heterogeneity: individual skills help on some task types while hurting on others, with effects that cancel in aggregate.
03ASSAY suppresses skills with a negative predicted effect for each specific test task at inference time, without modifying model weights.
04Evaluated across seven base models spanning four providers on two benchmarks: AppWorld and tau-bench.
05DeepSeek-V3 achieves 69.3% task-goal completion on AppWorld's hardest split — a 47.4% relative improvement and a new state of the art, including over weight-tuned approaches.
06GPT-4.1 improves by 8.7% relative on tau-bench retail, surpassing o4-mini, o1, and GPT-4.5 on the public leaderboard.
07Ablation studies confirm the dominant gain comes from per-task masking at inference time, not global removal of bad skills.

Topics

#agent-framework #benchmarks #reasoning #multi-agent #tool-use

Methodology

Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 16, 2026 · 23:11 UTC. How this works →

Jun 13, 2026·1 min readResearch Papers

ASSAY framework boosts LLM agents by matching skills to tasks at inference time

ArXiv·Yixuan Wang, Yiyang Zhou, Yiming Liang

Read at source

Composite

6.5

out of 10

Novelty · 25%

Novelty

Impact · 43%

Impact

Credibility · 12%

Credibility

Depth · 20%

Depth

Weights applied. How scores work ↗

Why it matters

01ASSAY separates skill generation from skill curation, using randomized masking to measure per-skill causal contributions.
02Skill libraries exhibit pervasive causal heterogeneity: individual skills help on some task types while hurting on others, with effects that cancel in aggregate.
03ASSAY suppresses skills with a negative predicted effect for each specific test task at inference time, without modifying model weights.

Summary— our read of the original

The framework was evaluated across seven base models spanning four providers on two benchmarks, AppWorld and tau-bench.

Key facts

01ASSAY separates skill generation from skill curation, using randomized masking to measure per-skill causal contributions.
02Skill libraries exhibit pervasive causal heterogeneity: individual skills help on some task types while hurting on others, with effects that cancel in aggregate.
03ASSAY suppresses skills with a negative predicted effect for each specific test task at inference time, without modifying model weights.
04Evaluated across seven base models spanning four providers on two benchmarks: AppWorld and tau-bench.
05DeepSeek-V3 achieves 69.3% task-goal completion on AppWorld's hardest split — a 47.4% relative improvement and a new state of the art, including over weight-tuned approaches.
06GPT-4.1 improves by 8.7% relative on tau-bench retail, surpassing o4-mini, o1, and GPT-4.5 on the public leaderboard.
07Ablation studies confirm the dominant gain comes from per-task masking at inference time, not global removal of bad skills.

Topics

#agent-framework #benchmarks #reasoning #multi-agent #tool-use

Methodology

Score breakdown

Key facts

Topics

More in Research Papers.

Score breakdown

Key facts

Topics

More in Research Papers.