Make LLM-based coding agents learn from and prune past shortcuts to improve code quality and stability

May 7, 20246 min

Overview

Production Readiness

0.45

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

2

Authors

Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

Iteratively refining and pruning agent experiences cuts noisy guidance, raises code quality by ~10% on the tested benchmark, and reduces the stored experience set to ~11.5%, saving storage and retrieval costs.

Summary TLDR

This paper introduces Iterative Experience Refinement (IER), a framework that lets multi-agent, LLM-based software developers iteratively collect, reuse, and prune "shortcut" experiences (solution→instruction and instruction→solution pairs). Two propagation patterns are studied: successive (inherit last batch) and cumulative (inherit all history). A heuristic elimination step keeps high-information and frequently used experiences, shrinking the pool to 11.54% while improving or maintaining software quality on the SRDD benchmark using ChatGPT-3.5.

Problem Statement

Current experience-enabled LLM agents use a fixed, heuristically collected set of past experiences. That static pool cannot be refined over time, which limits adaptability and lets low-quality or rarely used experiences accumulate and dilute useful guidance.

Main Contribution

Propose Iterative Experience Refinement (IER) to acquire, propagate, and refine agent experiences across task batches.

Define two propagation patterns: successive (inherit from previous batch) and cumulative (inherit from all past batches).

Introduce a heuristic elimination combining information gain and retrieval frequency to keep high-quality experiences and reduce pool size.

Key Findings

IER improves end-to-end software quality compared to prior experience-based methods on SRDD.

NumbersQuality: IER-Successive 0.6372 vs ECL 0.5775 (+10.3% rel)

Heuristic elimination concentrates useful experiences and drastically reduces pool size.

NumbersExperience pool reduced from 8,053 to 930 (11.54% retained)

Successive pattern reaches higher peaks but is less stable; cumulative pattern is more stable over batches.

Results

Quality (completeness×executability×consistency)

ValueIER-Successive 0.6372

BaselineECL 0.5775

Executability

ValueIER-Successive 0.9146

BaselineECL 0.8643

Completeness

ValueIER-Successive 0.8744

BaselineECL 0.8442

Duration (avg seconds or rounds)

ValueIER-Successive 179.444

BaselineECL 122.775

Who Should Care

What To Try In 7 Days

Run a small task-batch pipeline and log solution→instruction shortcuts during runs.

Implement vector-based retrieval (embeddings + cosine similarity) to reuse shortcuts as few-shot examples.

Experiment with two patterns: successive (only last batch) and cumulative (all history) and compare quality vs stability over 3–6 batches.

Agent Features

Memory

  • experience pool of shortcuts (solution→instruction and instruction→solution)
  • iterative update (successive or cumulative)

Planning

  • iterative refinement across batches

Tool Use

  • vector-based retrieval
  • external compiler for validation

Frameworks

  • ChatDev
  • ECL
  • ExpeL

Is Agentic

true

Architectures

  • multi-agent (instructive + responsive)
  • batch-wise experience propagation

Collaboration

  • role-based agent communication (instructor and responder)

Optimization Features

System Optimization

  • experience elimination reduces pool size and retrieval load

Inference Optimization

  • vector retrieval to reduce search latency

Reproducibility

Data Urls

  • SRDD (referenced from Qian et al. 2023a)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation uses only ChatGPT-3.5; results may differ with other LLMs.
  • Benchmark set is SRDD only; domain diversity is limited.
  • Elimination thresholds (ϵ, θ) are heuristic and fixed in experiments.
  • Successive pattern can amplify poor refinements and become unstable.

When Not To Use

  • Tasks needing novel, non-repeated solutions where past shortcuts can mislead.
  • Safety-critical or auditable code where automated reuse of prior shortcuts is risky.
  • Environments without compute/storage for embeddings and a vector DB.

Failure Modes

  • Experience pool growth dilutes high-quality experiences (cumulative pattern).
  • Poor refinements in one batch can propagate and degrade future results (successive pattern).
  • Reliance on embedding similarity may retrieve semantically wrong shortcuts.

Core Entities

Models

  • ChatGPT-3.5
  • text-embedding-ada-002

Metrics

  • Completeness
  • Executability
  • Consistency
  • Quality
  • Duration

Datasets

  • SRDD