AutoRefine: automatically extract reusable skills and subagents from past runs to continually improve LLM agents

Overview

Decision SnapshotNeeds Validation

The approach is engineering-ready for research and pilot production: it shows clear, repeatable gains on public benchmarks but requires tuning (K, k, θ) and careful auditing of extracted patterns before wide deployment.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 65%

Authors

Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, Shuo Tang

Links

Abstract / PDF / Data

Why It Matters For Business

AutoRefine automates turning past successful runs into reusable procedures and small specialized agents, which lowers action counts and improves success on complex planning—reducing latency and operation cost for task-heavy, tool-using products.

Who Should Care

ML Engineer Product Manager Engineering Lead

Summary TLDR

AutoRefine is a pipeline that turns successful agent execution traces into two reusable artifacts: skill patterns (text or code guidelines) and subagent patterns (small specialized agents with memory). It extracts patterns in batches, retrieves them with multi-query embeddings, and maintains the repository by scoring, pruning (bottom 20%), and merging similar entries. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, it improves success rates (98.4%, 70.4%, 27.1%) and cuts steps by 20–73%. The system is best for complex procedural tasks where encapsulating stateful subtasks helps reuse.

Problem Statement

LLM agents fail to accumulate procedural knowledge: past work stores flat text or raw trajectories that cannot capture multi-step logic or state. As experience accumulates, repositories become large and noisy because there is no active maintenance, which hurts retrieval and performance.

Main Contribution

Dual-pattern extraction: automatically create skill patterns (guidelines or code) and subagent patterns (small specialized agents with memory) from execution traces.

Batch, contrastive extraction: extract patterns from groups of successful vs failed trajectories to find generalizable strategies.

Key Findings

AutoRefine improves success rates across diverse benchmarks.

NumbersALFWorld 98.4% ±1.5, ScienceWorld 70.4% ±1.9, TravelPlanner 27.1% ±2.4

Practical UseUse AutoRefine to boost agent success on tasks that benefit from reusable procedures; expect largest gains on domains with procedural structure.

Evidence RefTable 1; §4.2

AutoRefine reduces action steps substantially versus reflection-based baselines.

NumbersSteps reduced by 20.5% (ALFWorld), 59.0% (ScienceWorld), 72.8% (TravelPlanner)

Practical UseIf your system pays per-action or latency cost, adding pattern extraction can cut operational steps and cost, especially on hard planning tasks.

Evidence RefTable 1; §4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ALFWorld success rate (SR)	98.4% ±1.5	ReAct+Reflexion 95.5% ±1.9	+2.9% absolute	ALFWorld (test)	Table 1, §4.2	Table 1
ScienceWorld pass rate	70.4% ±1.9	ReAct+Reflexion 69.2% ±2.1	+1.2% absolute	ScienceWorld (selected tasks)	Table 1, §4.2	Table 1

What To Try In 7 Days

Run a small AutoRefine pipeline on your recent agent logs: extract patterns every K=10 tasks and measure step count and success change.

Implement maintenance: score patterns by success-rate, usage, and precision and prune bottom 20% to control repository size.

Seed one human-designed subagent for a hard subtask and compare cold-start gains (as in AttractionPlanner experiment).

Agent Features

Memory

subagent local memory (stateful)metadata-driven repository usage stats

Planning

pattern-augmented planningmulti-query retrieval planning

Tool Use

Frameworks

AutoRefine pattern repositorycontrastive batch extraction agents

Is Agentic

Yes

Architectures

hierarchical delegationpattern repository + retrieval

Collaboration

master coordinator delegates to subagents

Optimization Features

Token Efficiency

reduces redundant reasoning steps by delegating to subagents

Infra Optimization

Elasticsearch + embedding-based retrieval to scale pattern lookup

System Optimization

percentile-based pruning (bottom 20%) to limit repository growth

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

ALFWorld (Shridhar et al.), ScienceWorld (Wang et al.), TravelPlanner (Xie et al.)

Risks & Boundaries

Limitations

Pattern type classification can mislabel strategies, creating unnecessary subagents or missing stateful logic (A.2).

Pattern contexts can be too broad or too narrow, causing low retrieval precision or missed opportunities (A.2).

When Not To Use

Small, static tasks where single-shot LLM reasoning suffices and pattern overhead is unnecessary.

Privacy-sensitive domains where execution traces cannot be safely stored or patterns might leak sensitive data.

Failure Modes

Misclassified patterns lead to many subagent calls and invocation overhead.

No maintenance: repository bloat causes low utilization and degraded retrieval quality.

Core Entities

Models

Claude-sonnet-4GPT-4-turboQwen3-Embedding-4B

Metrics

Success Rate (SR / Pass@1 / Final Pass Rate)Steps (mean)Repository size (patterns)Pattern utilization rate (u_j / r_j)

Datasets

ALFWorldScienceWorldTravelPlanner

Benchmarks

ALFWorldScienceWorldTravelPlanner

Context Entities

Models

GPT-4-turbo (used for ALFWorld comparisons)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AutoRefine improves success rates across diverse benchmarks.

AutoRefine reduces action steps substantially versus reflection-based baselines.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Close the Intent–Execution Gap by compiling a creator's 'Vibe' into multi-agent workflows

Key finding

Search LLM agents faster: jointly search workflows plus memory, planning and tool modules with a learned performance model

Key finding

Use a hierarchical graph of LLM 'thoughts' to improve retrieval and reduce hallucinations

Key finding

Use modal logic + Kripke belief states to constrain LMs and produce verifiable autonomous diagnostics

Key finding

G-Memory: a plug‑in three-tier graph memory that helps multi-agent teams learn from past collaborations

Key finding