Overview
The method is practical and shows dataset-level gains with ChatGPT-3.5, but is evaluated on a single dataset and model, so expect further engineering for production use.
Citations2
Evidence Strength0.60
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 2/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 45%
Production readiness: 45%
Novelty: 60%
Why It Matters For Business
Iteratively refining and pruning agent experiences cuts noisy guidance, raises code quality by ~10% on the tested benchmark, and reduces the stored experience set to ~11.5%, saving storage and retrieval costs.
Who Should Care
Summary TLDR
This paper introduces Iterative Experience Refinement (IER), a framework that lets multi-agent, LLM-based software developers iteratively collect, reuse, and prune "shortcut" experiences (solution→instruction and instruction→solution pairs). Two propagation patterns are studied: successive (inherit last batch) and cumulative (inherit all history). A heuristic elimination step keeps high-information and frequently used experiences, shrinking the pool to 11.54% while improving or maintaining software quality on the SRDD benchmark using ChatGPT-3.5.
Problem Statement
Current experience-enabled LLM agents use a fixed, heuristically collected set of past experiences. That static pool cannot be refined over time, which limits adaptability and lets low-quality or rarely used experiences accumulate and dilute useful guidance.
Main Contribution
Propose Iterative Experience Refinement (IER) to acquire, propagate, and refine agent experiences across task batches.
Define two propagation patterns: successive (inherit from previous batch) and cumulative (inherit from all past batches).
Key Findings
IER improves end-to-end software quality compared to prior experience-based methods on SRDD.
Heuristic elimination concentrates useful experiences and drastically reduces pool size.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Quality (completeness×executability×consistency) | IER-Successive 0.6372 | ECL 0.5775 | +0.0597 (+10.3% rel) | SRDD (avg over dataset) | Table 1: Quality scores | Table 1 |
| Executability | IER-Successive 0.9146 | ECL 0.8643 | +0.0503 | SRDD (avg) | Table 1: Executability scores | Table 1 |
What To Try In 7 Days
Run a small task-batch pipeline and log solution→instruction shortcuts during runs.
Implement vector-based retrieval (embeddings + cosine similarity) to reuse shortcuts as few-shot examples.
Experiment with two patterns: successive (only last batch) and cumulative (all history) and compare quality vs stability over 3–6 batches.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation uses only ChatGPT-3.5; results may differ with other LLMs.
Benchmark set is SRDD only; domain diversity is limited.
When Not To Use
Tasks needing novel, non-repeated solutions where past shortcuts can mislead.
Safety-critical or auditable code where automated reuse of prior shortcuts is risky.
Failure Modes
Experience pool growth dilutes high-quality experiences (cumulative pattern).
Poor refinements in one batch can propagate and degrade future results (successive pattern).

