Make LLM-based coding agents learn from and prune past shortcuts to improve code quality and stability

May 7, 20246 min

Overview

Decision SnapshotNeeds Validation

The method is practical and shows dataset-level gains with ChatGPT-3.5, but is evaluated on a single dataset and model, so expect further engineering for production use.

Citations2

Evidence Strength0.60

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 45%

Production readiness: 45%

Novelty: 60%

Authors

Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Data

Why It Matters For Business

Iteratively refining and pruning agent experiences cuts noisy guidance, raises code quality by ~10% on the tested benchmark, and reduces the stored experience set to ~11.5%, saving storage and retrieval costs.

Who Should Care

Summary TLDR

This paper introduces Iterative Experience Refinement (IER), a framework that lets multi-agent, LLM-based software developers iteratively collect, reuse, and prune "shortcut" experiences (solution→instruction and instruction→solution pairs). Two propagation patterns are studied: successive (inherit last batch) and cumulative (inherit all history). A heuristic elimination step keeps high-information and frequently used experiences, shrinking the pool to 11.54% while improving or maintaining software quality on the SRDD benchmark using ChatGPT-3.5.

Problem Statement

Current experience-enabled LLM agents use a fixed, heuristically collected set of past experiences. That static pool cannot be refined over time, which limits adaptability and lets low-quality or rarely used experiences accumulate and dilute useful guidance.

Main Contribution

Propose Iterative Experience Refinement (IER) to acquire, propagate, and refine agent experiences across task batches.

Define two propagation patterns: successive (inherit from previous batch) and cumulative (inherit from all past batches).

Key Findings

IER improves end-to-end software quality compared to prior experience-based methods on SRDD.

NumbersQuality: IER-Successive 0.6372 vs ECL 0.5775 (+10.3% rel)

Practical UseAdd iterative experience refinement to multi-agent coding pipelines to gain about a 10% relative quality uplift on the evaluated benchmark.

Evidence RefTable 1

Heuristic elimination concentrates useful experiences and drastically reduces pool size.

NumbersExperience pool reduced from 8,053 to 930 (11.54% retained)

Practical UseApply information-gain and retrieval-frequency filters to keep ~10–12% of experiences and cut storage and retrieval costs while keeping or improving performance.

Evidence RefSection 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Quality (completeness×executability×consistency)IER-Successive 0.6372ECL 0.5775+0.0597 (+10.3% rel)SRDD (avg over dataset)Table 1: Quality scoresTable 1
ExecutabilityIER-Successive 0.9146ECL 0.8643+0.0503SRDD (avg)Table 1: Executability scoresTable 1

What To Try In 7 Days

Run a small task-batch pipeline and log solution→instruction shortcuts during runs.

Implement vector-based retrieval (embeddings + cosine similarity) to reuse shortcuts as few-shot examples.

Experiment with two patterns: successive (only last batch) and cumulative (all history) and compare quality vs stability over 3–6 batches.

Agent Features

Memory
experience pool of shortcuts (solution→instruction and instruction→solution)iterative update (successive or cumulative)
Planning
iterative refinement across batches
Tool Use
vector-based retrievalexternal compiler for validation
Frameworks
ChatDevECLExpeL
Is Agentic

Yes

Architectures
multi-agent (instructive + responsive)batch-wise experience propagation
Collaboration
role-based agent communication (instructor and responder)

Optimization Features

System Optimization
experience elimination reduces pool size and retrieval load
Inference Optimization
vector retrieval to reduce search latency

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

SRDD (referenced from Qian et al. 2023a)

Risks & Boundaries

Limitations

Evaluation uses only ChatGPT-3.5; results may differ with other LLMs.

Benchmark set is SRDD only; domain diversity is limited.

When Not To Use

Tasks needing novel, non-repeated solutions where past shortcuts can mislead.

Safety-critical or auditable code where automated reuse of prior shortcuts is risky.

Failure Modes

Experience pool growth dilutes high-quality experiences (cumulative pattern).

Poor refinements in one batch can propagate and degrade future results (successive pattern).

Core Entities

Models

ChatGPT-3.5text-embedding-ada-002

Metrics

CompletenessExecutabilityConsistencyQualityDuration

Datasets

SRDD