Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Reusable skills turn procedural know-how into callable modules that raise agent reliability and can reduce model/runtime costs; but marketplaces and self-generation introduce supply-chain and quality risks.
Summary TLDR
This Systematization of Knowledge defines "agentic skills": reusable, callable procedural modules (applicability, policy, termination, interface). It maps a skill lifecycle (discovery → practice → distillation → storage → retrieval/composition → execution → update), describes seven system-level design patterns (metadata, code-as-skill, workflow enforcement, self-evolving libraries, hybrid NL+code, meta-skills, marketplaces), shows deterministic evaluation approaches, and analyzes security/governance risks anchored by the ClawHavoc marketplace attack and SkillsBench results. Curated skills substantially help on many domains; unverified/self-generated skills often hurt. The paper highlights an
Problem Statement
LLM agents repeatedly re-derive procedural routines per task because procedural knowledge disappears after a session. We need a reusable, executable unit of procedural memory—'skills'—and a practical map of how to discover, build, store, compose, evaluate, and govern them for reliable long-horizon agent behavior.
Main Contribution
Unified formal definition of an agentic skill as a tuple (applicability C, policy π, termination T, interface R).
A seven-pattern system-level taxonomy for how skills are packaged and executed in practice.
An orthogonal representation × scope taxonomy describing what skills are (NL, code, policy, hybrid) and where they act (web, OS, SWE, robotics, etc.).
A lifecycle model covering discovery, practice, distillation, storage, retrieval/composition, execution, and update.
A security and governance analysis with a case study (ClawHavoc) showing large-scale supply-chain attacks.
An evaluation framework and anchor benchmark evidence (SkillsBench) showing curated skills often help while self-generated skills can harm.
Key Findings
Curated skill libraries materially improve agent success on evaluated benchmarks.
Self-generated skills often degrade performance versus having no skills.
Skill benefit varies sharply by domain.
Marketplaces can be weaponized at large scale.
Supply-chain quality issues were widespread in a fast-growing registry.
A focused, small set of skill modules performs best.
Smaller models can close gaps with skills.
Results
Average pass-rate uplift (curated skills)
Average pass-rate delta (self-generated skills)
Malicious skill count in ClawHub (ClawHavoc)
Fraction of published skills with ≥1 security flaw
Best domain uplift (curated skills)
Who Should Care
What To Try In 7 Days
Audit your agent's skill surface: list skills, trust tiers, and which run autonomously.
Add deterministic verifiers or unit tests for high-value skills before enabling autonomous execution.
Start with a small curated skill set (2–3 focused modules per workflow) and measure pass-rate uplift.
Agent Features
Memory
- indexed skill libraries
- episodic context linking (multi-level memory)
Planning
- LLM-mediated routing
- embedding-based retrieval
- tree-search recovery (LATS)
Tool Use
- tool orchestration (multi-tool macros)
- code-as-skill execution
- plugin/marketplace integration
Frameworks
- Voyager
- Claude Code
- Semantic Kernel
- LangChain
- OpenClaw/ClawHub
Is Agentic
true
Architectures
- LLM planner + skill library
- hierarchical skill composition
- sandboxed code runtimes
Collaboration
- multi-agent role-based pipelines
- shared skill repositories
Optimization Features
Token Efficiency
- metadata-driven loading (load full spec only on demand)
System Optimization
- indexing and embedding retrieval for quick skill lookup
Training Optimization
- distillation of traces into smaller policies (AgentTuning, FireAct)
Inference Optimization
- progressive disclosure to save context tokens
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- The empirical claims lean heavily on one anchor benchmark (SkillsBench) and a single marketplace case study (ClawHavoc), limiting generality.
- Taxonomies derive from 24 in-depth systems and may shift as new architectures and marketplaces emerge.
- Many production systems and private datasets are unavailable, so coverage of industry practice may be incomplete.
When Not To Use
- Do not rely on self-generated skills without automated verification and held-out testing.
- Avoid loading full instructions from unvetted marketplace skills into high-privilege agent contexts.
- Avoid comprehensive, large-scope 'reference' skills that overload context and reduce performance.
Failure Modes
- Skill poisoning through metadata manipulations leading to inappropriate selection (C-poisoning).
- Malicious payloads in code or NL parts of skills that exfiltrate credentials or escalate privileges.
- Skill drift where unchanged skills break as APIs/UIs change, causing silent failures.
- Recursive amplification where meta-skills generate flawed skills that become templates for further flawed generation.
Core Entities
Models
- GPT-4
- Claude Opus/Haiku/Code (examples)
- Llama (AgentTuning context)
- Codex
- GPT-5.2 (referenced)
Metrics
- pass rate (percent points)
- domain delta (pp)
- task success
- number of adversarial/malicious skills
- reputation score (0-100)
Datasets
- SkillsBench (86 tasks, 7,308 trajectories)
- WebArena
- Mind2Web
- OSWorld
- SWE-bench
- AgentBench
Benchmarks
- SkillsBench
- WebArena
- Mind2Web
- OSWorld
- SWE-bench
- AgentBench
- GAIA
- AndroidWorld

