Make LLM-based coding agents learn from and prune past shortcuts to improve code quality and stability

Overview

Decision SnapshotNeeds Validation

The method is practical and shows dataset-level gains with ChatGPT-3.5, but is evaluated on a single dataset and model, so expect further engineering for production use.

Citations2

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 45%

Production readiness: 45%

Novelty: 60%

Authors

Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Data

Why It Matters For Business

Iteratively refining and pruning agent experiences cuts noisy guidance, raises code quality by ~10% on the tested benchmark, and reduces the stored experience set to ~11.5%, saving storage and retrieval costs.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO Founder Data Scientist

Summary TLDR

This paper introduces Iterative Experience Refinement (IER), a framework that lets multi-agent, LLM-based software developers iteratively collect, reuse, and prune "shortcut" experiences (solution→instruction and instruction→solution pairs). Two propagation patterns are studied: successive (inherit last batch) and cumulative (inherit all history). A heuristic elimination step keeps high-information and frequently used experiences, shrinking the pool to 11.54% while improving or maintaining software quality on the SRDD benchmark using ChatGPT-3.5.

Problem Statement

Current experience-enabled LLM agents use a fixed, heuristically collected set of past experiences. That static pool cannot be refined over time, which limits adaptability and lets low-quality or rarely used experiences accumulate and dilute useful guidance.

Main Contribution

Propose Iterative Experience Refinement (IER) to acquire, propagate, and refine agent experiences across task batches.

Define two propagation patterns: successive (inherit from previous batch) and cumulative (inherit from all past batches).

Key Findings

IER improves end-to-end software quality compared to prior experience-based methods on SRDD.

NumbersQuality: IER-Successive 0.6372 vs ECL 0.5775 (+10.3% rel)

Practical UseAdd iterative experience refinement to multi-agent coding pipelines to gain about a 10% relative quality uplift on the evaluated benchmark.

Evidence RefTable 1

Heuristic elimination concentrates useful experiences and drastically reduces pool size.

NumbersExperience pool reduced from 8,053 to 930 (11.54% retained)

Practical UseApply information-gain and retrieval-frequency filters to keep ~10–12% of experiences and cut storage and retrieval costs while keeping or improving performance.

Evidence RefSection 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Quality (completeness×executability×consistency)	IER-Successive 0.6372	ECL 0.5775	+0.0597 (+10.3% rel)	SRDD (avg over dataset)	Table 1: Quality scores	Table 1
Executability	IER-Successive 0.9146	ECL 0.8643	+0.0503	SRDD (avg)	Table 1: Executability scores	Table 1

What To Try In 7 Days

Run a small task-batch pipeline and log solution→instruction shortcuts during runs.

Implement vector-based retrieval (embeddings + cosine similarity) to reuse shortcuts as few-shot examples.

Experiment with two patterns: successive (only last batch) and cumulative (all history) and compare quality vs stability over 3–6 batches.

Agent Features

Memory

experience pool of shortcuts (solution→instruction and instruction→solution)iterative update (successive or cumulative)

Planning

iterative refinement across batches

Tool Use

vector-based retrievalexternal compiler for validation

Frameworks

ChatDevECLExpeL

Is Agentic

Yes

Architectures

multi-agent (instructive + responsive)batch-wise experience propagation

Collaboration

role-based agent communication (instructor and responder)

Optimization Features

System Optimization

experience elimination reduces pool size and retrieval load

Inference Optimization

vector retrieval to reduce search latency

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

SRDD (referenced from Qian et al. 2023a)

Risks & Boundaries

Limitations

Evaluation uses only ChatGPT-3.5; results may differ with other LLMs.

Benchmark set is SRDD only; domain diversity is limited.

When Not To Use

Tasks needing novel, non-repeated solutions where past shortcuts can mislead.

Safety-critical or auditable code where automated reuse of prior shortcuts is risky.

Failure Modes

Experience pool growth dilutes high-quality experiences (cumulative pattern).

Poor refinements in one batch can propagate and degrade future results (successive pattern).

Core Entities

Models

ChatGPT-3.5text-embedding-ada-002

Metrics

CompletenessExecutabilityConsistencyQualityDuration

Datasets

SRDD

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

IER improves end-to-end software quality compared to prior experience-based methods on SRDD.

Heuristic elimination concentrates useful experiences and drastically reduces pool size.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding

A practical survey of memory in LLMs: implicit weights, external retrieval, and agent memory

Key finding

A-MEM: LLM agents that build and evolve a Zettelkasten-style linked memory

Key finding

Use LLM agents plus DRL and tiny adapters to meet operator intents while cutting active radio units and memory use

Key finding