Teach coding agents from past runs: extract and reuse 'shortcuts' to speed multi-agent software development

Overview

Decision SnapshotReady For Pilot

The method gives a clear practical recipe (graph+shortcut extraction+retrieval) and shows strong empirical gains on SRDD, but real-world production readiness is limited by evaluation scope and reliance on compile checks.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, Yifei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Reusing vetted past fixes reduces developer iteration time and increases the chance generated prototypes are runnable, cutting manual triage and speeding prototyping.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces Experiential Co-Learning: a two-role (instructor, assistant) multi-agent framework that records multi-step agent interactions as task graphs, extracts high-value non-adjacent transitions called "shortcuts" using compile and similarity signals, and retrieves those experiences as few-shot examples during future reasoning. On the SRDD software-requirement dataset, this approach raises a holistic quality metric from 0.4267 to 0.7304 and shortens development time versus strong multi-agent baselines. Code and data are available at the project's GitHub.

Problem Statement

Multi-agent coding systems treat each new task independently, causing repeated mistakes and wasted iterations because past cross-task experience is not captured or reused. The paper tackles how to design, collect and apply reusable experiences to make agent collaboration faster and more reliable.

Main Contribution

Proposes Experiential Co-Learning: co-tracking, co-memorizing, co-reasoning to collect and reuse agent experiences.

Introduces task-execution graphs and extracts heuristic non-adjacent 'shortcuts' (compile + similarity filtered) as key experiences.

Key Findings

Experience reuse almost doubles the holistic software quality metric versus a strong multi-agent baseline.

NumbersQuality 0.4267 -> 0.7304 (test set)

Practical UseAdd a small experience store and retrieval step to multi-agent pipelines to raise end-to-end software quality and reduce manual fixes.

Evidence RefTable 1

Completeness and executability improve substantially when agents reuse shortcuts.

NumbersCompleteness 0.6131 -> 0.9497; Executability 0.88 -> 0.965

Practical UseReusing past validated code fragments increases the chance generated projects are complete and compile immediately.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Quality	0.7304	ChatDev 0.4267	+0.3037	SRDD test set	Co-Learning quality 0.7304 vs ChatDev 0.4267	Table 1
Completeness	0.9497	ChatDev 0.6131	+0.3366	SRDD test set	Higher percentage of code without TODOs	Table 1

What To Try In 7 Days

Log agent instruction/solution pairs during multi-turn runs.

Build a simple deduplicated task graph using a hash of code snapshots.

Keep shortcuts that compile and match requirements; store as key-value experiences (instruction->solution and solution->instruction).

Agent Features

Memory

experience pools (key-value shortcut memories)

Planning

multi-turn planning via iterative instruction-solution cycles

Tool Use

external compiler and code checkerembedding-based retrieval

Frameworks

co-trackingco-memorizingco-reasoning

Is Agentic

Yes

Architectures

two-role instructor-assistant multi-agent

Collaboration

role-based multi-turn communicationfew-shot example exchange

Optimization Features

Token Efficiency

use of single best code example (k_code=1) reduces context size

Training Optimization

heuristic shortcut selection to focus useful examples

Inference Optimization

retrieve top-k experiences to build in-context examples

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/OpenBMB/ChatDev

Data URLs

https://github.com/OpenBMB/ChatDev

Risks & Boundaries

Limitations

Agents tend to implement simple logic; suitable for prototypes not full production systems.

Evaluation uses SRDD and compile-based checks; lacks broad real-world validation.

When Not To Use

For safety-critical or production systems without human review.

When requirements are vague or require complex domain reasoning.

Failure Modes

Solution backtracking and correct-to-failure degeneration if shortcuts are noisy.

Over-reliance on past experiences can repeat past mistakes on novel tasks.

Core Entities

Models

GPT-3.5-Turbotext-embedding-ada-002GPT-4 (evaluator)

Metrics

CompletenessExecutabilityConsistencyQuality (product of three metrics)Duration (s)

Teach coding agents from past runs: extract and reuse 'shortcuts' to speed multi-agent software development

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Experience reuse almost doubles the holistic software quality metric versus a strong multi-agent baseline.

Completeness and executability improve substantially when agents reuse shortcuts.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Experience reuse almost doubles the holistic software quality metric versus a strong multi-agent baseline.

Completeness and executability improve substantially when agents reuse shortcuts.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding