Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

February 24, 20269 min

Overview

Decision SnapshotNeeds Validation

The SoK integrates many systems and benchmarks to give coherent practical guidance, but empirical evidence is anchored to a few benchmarks (notably SkillsBench) and a marketplace case study; more cross-platform replication and production studies are needed.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, Guangsheng Yu

Links

Abstract / PDF

Why It Matters For Business

Reusable skills turn procedural know-how into callable modules that raise agent reliability and can reduce model/runtime costs; but marketplaces and self-generation introduce supply-chain and quality risks.

Who Should Care

Summary TLDR

This Systematization of Knowledge defines "agentic skills": reusable, callable procedural modules (applicability, policy, termination, interface). It maps a skill lifecycle (discovery → practice → distillation → storage → retrieval/composition → execution → update), describes seven system-level design patterns (metadata, code-as-skill, workflow enforcement, self-evolving libraries, hybrid NL+code, meta-skills, marketplaces), shows deterministic evaluation approaches, and analyzes security/governance risks anchored by the ClawHavoc marketplace attack and SkillsBench results. Curated skills substantially help on many domains; unverified/self-generated skills often hurt. The paper highlights an

Problem Statement

LLM agents repeatedly re-derive procedural routines per task because procedural knowledge disappears after a session. We need a reusable, executable unit of procedural memory—'skills'—and a practical map of how to discover, build, store, compose, evaluate, and govern them for reliable long-horizon agent behavior.

Main Contribution

Unified formal definition of an agentic skill as a tuple (applicability C, policy π, termination T, interface R).

A seven-pattern system-level taxonomy for how skills are packaged and executed in practice.

Key Findings

Curated skill libraries materially improve agent success on evaluated benchmarks.

Numbers+16.2 pp average pass-rate gain

Practical UseInvest in curated, verified skills for production agents to boost task pass rates; add deterministic verifiers where possible.

Evidence RefSkillsBench (§8.4)

Self-generated skills often degrade performance versus having no skills.

Numbers-1.3 pp average delta for self-generated skills

Practical UseDo not auto-admit agent-created skills into libraries without held-out verification and gating; require tests before reuse.

Evidence RefSkillsBench (§5.5, §8.4)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average pass-rate uplift (curated skills)+16.2 ppno-skills+16.2 ppSkillsBench (86 tasks, 7,308 trajectories)SkillsBench overall comparison (§8.4)§8.4
Average pass-rate delta (self-generated skills)-1.3 ppno-skills-1.3 ppSkillsBenchSkillsBench found self-generated skills degrade average performance (§5.5, §8.4)§5.5 §8.4

What To Try In 7 Days

Audit your agent's skill surface: list skills, trust tiers, and which run autonomously.

Add deterministic verifiers or unit tests for high-value skills before enabling autonomous execution.

Start with a small curated skill set (2–3 focused modules per workflow) and measure pass-rate uplift.

Agent Features

Memory
indexed skill librariesepisodic context linking (multi-level memory)
Planning
LLM-mediated routingembedding-based retrievaltree-search recovery (LATS)
Tool Use
tool orchestration (multi-tool macros)code-as-skill executionplugin/marketplace integration
Frameworks
VoyagerClaude CodeSemantic KernelLangChainOpenClaw/ClawHub
Is Agentic

Yes

Architectures
LLM planner + skill libraryhierarchical skill compositionsandboxed code runtimes
Collaboration
multi-agent role-based pipelinesshared skill repositories

Optimization Features

Token Efficiency
metadata-driven loading (load full spec only on demand)
System Optimization
indexing and embedding retrieval for quick skill lookup
Training Optimization
distillation of traces into smaller policies (AgentTuning, FireAct)
Inference Optimization
progressive disclosure to save context tokens

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

The empirical claims lean heavily on one anchor benchmark (SkillsBench) and a single marketplace case study (ClawHavoc), limiting generality.

Taxonomies derive from 24 in-depth systems and may shift as new architectures and marketplaces emerge.

When Not To Use

Do not rely on self-generated skills without automated verification and held-out testing.

Avoid loading full instructions from unvetted marketplace skills into high-privilege agent contexts.

Failure Modes

Skill poisoning through metadata manipulations leading to inappropriate selection (C-poisoning).

Malicious payloads in code or NL parts of skills that exfiltrate credentials or escalate privileges.

Core Entities

Models

GPT-4Claude Opus/Haiku/Code (examples)Llama (AgentTuning context)CodexGPT-5.2 (referenced)

Metrics

pass rate (percent points)domain delta (pp)task successnumber of adversarial/malicious skillsreputation score (0-100)

Datasets

SkillsBench (86 tasks, 7,308 trajectories)WebArenaMind2WebOSWorldSWE-benchAgentBench

Benchmarks

SkillsBenchWebArenaMind2WebOSWorldSWE-benchAgentBenchGAIAAndroidWorld