Overview
The SoK integrates many systems and benchmarks to give coherent practical guidance, but empirical evidence is anchored to a few benchmarks (notably SkillsBench) and a marketplace case study; more cross-platform replication and production studies are needed.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 3/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Reusable skills turn procedural know-how into callable modules that raise agent reliability and can reduce model/runtime costs; but marketplaces and self-generation introduce supply-chain and quality risks.
Who Should Care
Summary TLDR
This Systematization of Knowledge defines "agentic skills": reusable, callable procedural modules (applicability, policy, termination, interface). It maps a skill lifecycle (discovery → practice → distillation → storage → retrieval/composition → execution → update), describes seven system-level design patterns (metadata, code-as-skill, workflow enforcement, self-evolving libraries, hybrid NL+code, meta-skills, marketplaces), shows deterministic evaluation approaches, and analyzes security/governance risks anchored by the ClawHavoc marketplace attack and SkillsBench results. Curated skills substantially help on many domains; unverified/self-generated skills often hurt. The paper highlights an
Problem Statement
LLM agents repeatedly re-derive procedural routines per task because procedural knowledge disappears after a session. We need a reusable, executable unit of procedural memory—'skills'—and a practical map of how to discover, build, store, compose, evaluate, and govern them for reliable long-horizon agent behavior.
Main Contribution
Unified formal definition of an agentic skill as a tuple (applicability C, policy π, termination T, interface R).
A seven-pattern system-level taxonomy for how skills are packaged and executed in practice.
Key Findings
Curated skill libraries materially improve agent success on evaluated benchmarks.
Self-generated skills often degrade performance versus having no skills.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Average pass-rate uplift (curated skills) | +16.2 pp | no-skills | +16.2 pp | SkillsBench (86 tasks, 7,308 trajectories) | SkillsBench overall comparison (§8.4) | §8.4 |
| Average pass-rate delta (self-generated skills) | -1.3 pp | no-skills | -1.3 pp | SkillsBench | SkillsBench found self-generated skills degrade average performance (§5.5, §8.4) | §5.5 §8.4 |
What To Try In 7 Days
Audit your agent's skill surface: list skills, trust tiers, and which run autonomously.
Add deterministic verifiers or unit tests for high-value skills before enabling autonomous execution.
Start with a small curated skill set (2–3 focused modules per workflow) and measure pass-rate uplift.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
The empirical claims lean heavily on one anchor benchmark (SkillsBench) and a single marketplace case study (ClawHavoc), limiting generality.
Taxonomies derive from 24 in-depth systems and may shift as new architectures and marketplaces emerge.
When Not To Use
Do not rely on self-generated skills without automated verification and held-out testing.
Avoid loading full instructions from unvetted marketplace skills into high-privilege agent contexts.
Failure Modes
Skill poisoning through metadata manipulations leading to inappropriate selection (C-poisoning).
Malicious payloads in code or NL parts of skills that exfiltrate credentials or escalate privileges.

