SkillAudit evolves agent skills without ground-truth feedback
SkillAudit is a framework that improves LLM agent skills through paired trajectory auditing — running tasks with and without a candidate skill to isolate behavioral differences — achieving 73.9% average task reward across 89 tasks without accessing hidden tests or external scoring.
Score breakdown
SkillAudit removes the dependency on privileged external feedback signals that existing skill-evolution methods require, enabling agent skill improvement in real-world deployments where only a task description and workspace data are available.
- 01SkillAudit is authored by Haowen Gao, Haoran Chen, and Can Wang and posted to ArXiv on 2026-06-12.
- 02The framework evolves agent skills without ground-truth feedback such as held-out validation scores, hidden test outcomes, or environment rewards.
- 03Core technique is paired trajectory auditing: the same task is run with and without the candidate skill to isolate behavioral differences.
SkillAudit, authored by Haowen Gao, Haoran Chen, and Can Wang, addresses a practical gap in LLM agent deployment: skills — structured procedural packages that guide frozen LLM agents through specialized workflows — degrade after deployment as edge cases, API changes, and deployment constraints emerge. Existing skill-evolution methods rely on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards, which are often unavailable in real-world settings where a practitioner has only a task description and workspace data.
A structural verifier, compiled once from the task specification and then fixed, enforces task constraints and rolls back harmful updates.
The framework's central mechanism is paired trajectory auditing: at each iteration, the same task is executed both with and without the candidate skill, isolating the skill's behavioral contribution without external labels. To convert those behavioral differences into actionable edit guidance, SkillAudit employs Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals tied to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, enforces task constraints and rolls back harmful updates. Edits are routed through two pipelines: Refine, which removes noisy or irrelevant guidance from broadly useful skills, and Repair, which replaces passages that conflict with the task.
Evaluated across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward — outperforming both an agent operating without skills (40.9%) and a static expert skill baseline (56.7%). Critically, these gains are obtained without accessing hidden tests, reference solutions, or external scoring functions at any point during the evolution process.
Key facts
- 01SkillAudit is authored by Haowen Gao, Haoran Chen, and Can Wang and posted to ArXiv on 2026-06-12.
- 02The framework evolves agent skills without ground-truth feedback such as held-out validation scores, hidden test outcomes, or environment rewards.
- 03Core technique is paired trajectory auditing: the same task is run with and without the candidate skill to isolate behavioral differences.
- 04Process-Aligned Contrastive Evaluation (PACE) maps trajectory divergences to diagnostic signals linked to specific passages in the skill document.
- 05A structural verifier compiled from the task specification checks constraints and rolls back harmful updates.
- 06Two editing pipelines are used: Refine (removes noisy/irrelevant guidance) and Repair (replaces conflicting passages).
- 07SkillAudit achieves 73.9% average task reward across 89 containerized tasks in 8 domains, vs. 40.9% (no skill) and 56.7% (static expert skill).
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 15, 2026 · 11:57 UTC. How this works →