A benchmark created to measure how much Agent Skills (procedural task knowledge packages) actually improve performance.
Human-created Skills are effective with an average +16.2%p performance increase. Model-generated Skills are nearly ineffective. Models cannot generate Procedural Knowledge at a level they can effectively consume.
arxiv.org
https://arxiv.org/pdf/2602.12670

Seonglae Cho