Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Overview

Decision SnapshotNeeds Validation

The SoK integrates many systems and benchmarks to give coherent practical guidance, but empirical evidence is anchored to a few benchmarks (notably SkillsBench) and a marketplace case study; more cross-platform replication and production studies are needed.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 3/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, Guangsheng Yu

Links

Abstract / PDF

Why It Matters For Business

Reusable skills turn procedural know-how into callable modules that raise agent reliability and can reduce model/runtime costs; but marketplaces and self-generation introduce supply-chain and quality risks.

Who Should Care

Product Manager CTO Engineering Lead ML Engineer Founder

Summary TLDR

This Systematization of Knowledge defines "agentic skills": reusable, callable procedural modules (applicability, policy, termination, interface). It maps a skill lifecycle (discovery → practice → distillation → storage → retrieval/composition → execution → update), describes seven system-level design patterns (metadata, code-as-skill, workflow enforcement, self-evolving libraries, hybrid NL+code, meta-skills, marketplaces), shows deterministic evaluation approaches, and analyzes security/governance risks anchored by the ClawHavoc marketplace attack and SkillsBench results. Curated skills substantially help on many domains; unverified/self-generated skills often hurt. The paper highlights an

Problem Statement

LLM agents repeatedly re-derive procedural routines per task because procedural knowledge disappears after a session. We need a reusable, executable unit of procedural memory—'skills'—and a practical map of how to discover, build, store, compose, evaluate, and govern them for reliable long-horizon agent behavior.

Main Contribution

Unified formal definition of an agentic skill as a tuple (applicability C, policy π, termination T, interface R).

A seven-pattern system-level taxonomy for how skills are packaged and executed in practice.

Key Findings

Curated skill libraries materially improve agent success on evaluated benchmarks.

Numbers+16.2 pp average pass-rate gain

Practical UseInvest in curated, verified skills for production agents to boost task pass rates; add deterministic verifiers where possible.

Evidence RefSkillsBench (§8.4)

Self-generated skills often degrade performance versus having no skills.

Numbers-1.3 pp average delta for self-generated skills

Practical UseDo not auto-admit agent-created skills into libraries without held-out verification and gating; require tests before reuse.

Evidence RefSkillsBench (§5.5, §8.4)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average pass-rate uplift (curated skills)	+16.2 pp	no-skills	+16.2 pp	SkillsBench (86 tasks, 7,308 trajectories)	SkillsBench overall comparison (§8.4)	§8.4
Average pass-rate delta (self-generated skills)	-1.3 pp	no-skills	-1.3 pp	SkillsBench	SkillsBench found self-generated skills degrade average performance (§5.5, §8.4)	§5.5 §8.4

What To Try In 7 Days

Audit your agent's skill surface: list skills, trust tiers, and which run autonomously.

Add deterministic verifiers or unit tests for high-value skills before enabling autonomous execution.

Start with a small curated skill set (2–3 focused modules per workflow) and measure pass-rate uplift.

Agent Features

Memory

indexed skill librariesepisodic context linking (multi-level memory)

Planning

LLM-mediated routingembedding-based retrievaltree-search recovery (LATS)

Tool Use

tool orchestration (multi-tool macros)code-as-skill executionplugin/marketplace integration

Frameworks

VoyagerClaude CodeSemantic KernelLangChainOpenClaw/ClawHub

Is Agentic

Yes

Architectures

LLM planner + skill libraryhierarchical skill compositionsandboxed code runtimes

Collaboration

multi-agent role-based pipelinesshared skill repositories

Optimization Features

Token Efficiency

metadata-driven loading (load full spec only on demand)

System Optimization

indexing and embedding retrieval for quick skill lookup

Training Optimization

distillation of traces into smaller policies (AgentTuning, FireAct)

Inference Optimization

progressive disclosure to save context tokens

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

The empirical claims lean heavily on one anchor benchmark (SkillsBench) and a single marketplace case study (ClawHavoc), limiting generality.

Taxonomies derive from 24 in-depth systems and may shift as new architectures and marketplaces emerge.

When Not To Use

Do not rely on self-generated skills without automated verification and held-out testing.

Avoid loading full instructions from unvetted marketplace skills into high-privilege agent contexts.

Failure Modes

Skill poisoning through metadata manipulations leading to inappropriate selection (C-poisoning).

Malicious payloads in code or NL parts of skills that exfiltrate credentials or escalate privileges.

Core Entities

Models

GPT-4Claude Opus/Haiku/Code (examples)Llama (AgentTuning context)CodexGPT-5.2 (referenced)

Metrics

pass rate (percent points)domain delta (pp)task successnumber of adversarial/malicious skillsreputation score (0-100)

Datasets

SkillsBench (86 tasks, 7,308 trajectories)WebArenaMind2WebOSWorldSWE-benchAgentBench

Benchmarks

SkillsBenchWebArenaMind2WebOSWorldSWE-benchAgentBenchGAIAAndroidWorld

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Curated skill libraries materially improve agent success on evaluated benchmarks.

Self-generated skills often degrade performance versus having no skills.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A modular agent-based judge that checks step-by-step agent reasoning to better match human task-success labels

Key finding

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding

POLARIS: typed, policy-aware plan synthesis and guarded execution for auditable back-office automation

Key finding

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Use an LLM to self-evaluate during MCTS and cluster answers to improve multi-step reasoning without extra reward models

Key finding