Skillsbench

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Feb 20 11:32
Editor
Edited
Edited
2026 Feb 20 11:37
Refs
Refs
A benchmark created to measure how much
Agent Skills
(procedural task knowledge packages) actually improve performance.
Human-created Skills are effective with an average +16.2%p performance increase. Model-generated Skills are nearly ineffective. Models cannot generate
Procedural Knowledge
at a level they can effectively consume.
arxiv.org
 
 
 
 
 
 

Recommendations