Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

February 24, 20269 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, Guangsheng Yu

Links

Abstract / PDF

Why It Matters For Business

Reusable skills turn procedural know-how into callable modules that raise agent reliability and can reduce model/runtime costs; but marketplaces and self-generation introduce supply-chain and quality risks.

Summary TLDR

This Systematization of Knowledge defines "agentic skills": reusable, callable procedural modules (applicability, policy, termination, interface). It maps a skill lifecycle (discovery → practice → distillation → storage → retrieval/composition → execution → update), describes seven system-level design patterns (metadata, code-as-skill, workflow enforcement, self-evolving libraries, hybrid NL+code, meta-skills, marketplaces), shows deterministic evaluation approaches, and analyzes security/governance risks anchored by the ClawHavoc marketplace attack and SkillsBench results. Curated skills substantially help on many domains; unverified/self-generated skills often hurt. The paper highlights an

Problem Statement

LLM agents repeatedly re-derive procedural routines per task because procedural knowledge disappears after a session. We need a reusable, executable unit of procedural memory—'skills'—and a practical map of how to discover, build, store, compose, evaluate, and govern them for reliable long-horizon agent behavior.

Main Contribution

Unified formal definition of an agentic skill as a tuple (applicability C, policy π, termination T, interface R).

A seven-pattern system-level taxonomy for how skills are packaged and executed in practice.

An orthogonal representation × scope taxonomy describing what skills are (NL, code, policy, hybrid) and where they act (web, OS, SWE, robotics, etc.).

A lifecycle model covering discovery, practice, distillation, storage, retrieval/composition, execution, and update.

A security and governance analysis with a case study (ClawHavoc) showing large-scale supply-chain attacks.

An evaluation framework and anchor benchmark evidence (SkillsBench) showing curated skills often help while self-generated skills can harm.

Key Findings

Curated skill libraries materially improve agent success on evaluated benchmarks.

Numbers+16.2 pp average pass-rate gain

Self-generated skills often degrade performance versus having no skills.

Numbers-1.3 pp average delta for self-generated skills

Skill benefit varies sharply by domain.

NumbersHealthcare +51.9 pp; Manufacturing +41.9 pp; SWE +4.5 pp

Marketplaces can be weaponized at large scale.

Numbers≈1,184 malicious skills found in ClawHub (ClawHavoc)

Supply-chain quality issues were widespread in a fast-growing registry.

Numbers36.8% of published skills had ≥1 security flaw (Snyk)

A focused, small set of skill modules performs best.

Numbers2–3 module skills → +18.6 pp; 4+ skills → +5.9 pp

Smaller models can close gaps with skills.

NumbersClaude Haiku 4.5 w/ skills 27.7% vs Claude Opus 4.5 w/o skills 22.0%

Results

Average pass-rate uplift (curated skills)

Value+16.2 pp

Baselineno-skills

Average pass-rate delta (self-generated skills)

Value-1.3 pp

Baselineno-skills

Malicious skill count in ClawHub (ClawHavoc)

Value≈1,184 malicious skills

Fraction of published skills with ≥1 security flaw

Value36.8%

Best domain uplift (curated skills)

Value+51.9 pp

Baselineno-skills

Who Should Care

What To Try In 7 Days

Audit your agent's skill surface: list skills, trust tiers, and which run autonomously.

Add deterministic verifiers or unit tests for high-value skills before enabling autonomous execution.

Start with a small curated skill set (2–3 focused modules per workflow) and measure pass-rate uplift.

Agent Features

Memory

  • indexed skill libraries
  • episodic context linking (multi-level memory)

Planning

  • LLM-mediated routing
  • embedding-based retrieval
  • tree-search recovery (LATS)

Tool Use

  • tool orchestration (multi-tool macros)
  • code-as-skill execution
  • plugin/marketplace integration

Frameworks

  • Voyager
  • Claude Code
  • Semantic Kernel
  • LangChain
  • OpenClaw/ClawHub

Is Agentic

true

Architectures

  • LLM planner + skill library
  • hierarchical skill composition
  • sandboxed code runtimes

Collaboration

  • multi-agent role-based pipelines
  • shared skill repositories

Optimization Features

Token Efficiency

  • metadata-driven loading (load full spec only on demand)

System Optimization

  • indexing and embedding retrieval for quick skill lookup

Training Optimization

  • distillation of traces into smaller policies (AgentTuning, FireAct)

Inference Optimization

  • progressive disclosure to save context tokens

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • The empirical claims lean heavily on one anchor benchmark (SkillsBench) and a single marketplace case study (ClawHavoc), limiting generality.
  • Taxonomies derive from 24 in-depth systems and may shift as new architectures and marketplaces emerge.
  • Many production systems and private datasets are unavailable, so coverage of industry practice may be incomplete.

When Not To Use

  • Do not rely on self-generated skills without automated verification and held-out testing.
  • Avoid loading full instructions from unvetted marketplace skills into high-privilege agent contexts.
  • Avoid comprehensive, large-scope 'reference' skills that overload context and reduce performance.

Failure Modes

  • Skill poisoning through metadata manipulations leading to inappropriate selection (C-poisoning).
  • Malicious payloads in code or NL parts of skills that exfiltrate credentials or escalate privileges.
  • Skill drift where unchanged skills break as APIs/UIs change, causing silent failures.
  • Recursive amplification where meta-skills generate flawed skills that become templates for further flawed generation.

Core Entities

Models

  • GPT-4
  • Claude Opus/Haiku/Code (examples)
  • Llama (AgentTuning context)
  • Codex
  • GPT-5.2 (referenced)

Metrics

  • pass rate (percent points)
  • domain delta (pp)
  • task success
  • number of adversarial/malicious skills
  • reputation score (0-100)

Datasets

  • SkillsBench (86 tasks, 7,308 trajectories)
  • WebArena
  • Mind2Web
  • OSWorld
  • SWE-bench
  • AgentBench

Benchmarks

  • SkillsBench
  • WebArena
  • Mind2Web
  • OSWorld
  • SWE-bench
  • AgentBench
  • GAIA
  • AndroidWorld