Teach coding agents from past runs: extract and reuse 'shortcuts' to speed multi-agent software development

December 28, 20236 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Zihao Xie, Yifei Wang, Weize Chen, Cheng Yang, Xin Cong, Xiaoyin Che, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

Reusing vetted past fixes reduces developer iteration time and increases the chance generated prototypes are runnable, cutting manual triage and speeding prototyping.

Summary TLDR

The paper introduces Experiential Co-Learning: a two-role (instructor, assistant) multi-agent framework that records multi-step agent interactions as task graphs, extracts high-value non-adjacent transitions called "shortcuts" using compile and similarity signals, and retrieves those experiences as few-shot examples during future reasoning. On the SRDD software-requirement dataset, this approach raises a holistic quality metric from 0.4267 to 0.7304 and shortens development time versus strong multi-agent baselines. Code and data are available at the project's GitHub.

Problem Statement

Multi-agent coding systems treat each new task independently, causing repeated mistakes and wasted iterations because past cross-task experience is not captured or reused. The paper tackles how to design, collect and apply reusable experiences to make agent collaboration faster and more reliable.

Main Contribution

Proposes Experiential Co-Learning: co-tracking, co-memorizing, co-reasoning to collect and reuse agent experiences.

Introduces task-execution graphs and extracts heuristic non-adjacent 'shortcuts' (compile + similarity filtered) as key experiences.

Shows empirical gains on SRDD tasks, ablations that quantify roles of instructor vs assistant, and sensitivity of retrieval hyperparameters.

Key Findings

Experience reuse almost doubles the holistic software quality metric versus a strong multi-agent baseline.

NumbersQuality 0.4267 -> 0.7304 (test set)

Completeness and executability improve substantially when agents reuse shortcuts.

NumbersCompleteness 0.6131 -> 0.9497; Executability 0.88 -> 0.965

Assistant experience matters more than instructor-only experience.

NumbersRemove assistant: Quality 0.5305; remove instructor: Quality 0.6840; remove both: 0.4267

Reusing shortcuts reduces iterations and wall time compared to a strong multi-agent system.

NumbersAvg duration 148.215s -> 122.775s; average nodes 3.35/edges 3.885 -> 2.31/3.01 (Co-Learning)

Results

Quality

Value0.7304

BaselineChatDev 0.4267

Completeness

Value0.9497

BaselineChatDev 0.6131

Executability

Value0.9650

BaselineChatDev 0.88

Consistency

Value0.7970

BaselineChatDev 0.7909

Duration (s)

Value122.775

BaselineChatDev 148.215

Who Should Care

What To Try In 7 Days

Log agent instruction/solution pairs during multi-turn runs.

Build a simple deduplicated task graph using a hash of code snapshots.

Keep shortcuts that compile and match requirements; store as key-value experiences (instruction->solution and solution->instruction).

Agent Features

Memory

  • experience pools (key-value shortcut memories)

Planning

  • multi-turn planning via iterative instruction-solution cycles

Tool Use

  • external compiler and code checker
  • embedding-based retrieval

Frameworks

  • co-tracking
  • co-memorizing
  • co-reasoning

Is Agentic

true

Architectures

  • two-role instructor-assistant multi-agent

Collaboration

  • role-based multi-turn communication
  • few-shot example exchange

Optimization Features

Token Efficiency

  • use of single best code example (k_code=1) reduces context size

Training Optimization

  • heuristic shortcut selection to focus useful examples

Inference Optimization

  • retrieve top-k experiences to build in-context examples

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Agents tend to implement simple logic; suitable for prototypes not full production systems.
  • Evaluation uses SRDD and compile-based checks; lacks broad real-world validation.
  • Consistency metric depends on coarse embeddings and may miss subtle requirement mismatches.
  • Manual verification remains necessary for general-purpose software.

When Not To Use

  • For safety-critical or production systems without human review.
  • When requirements are vague or require complex domain reasoning.
  • When software behavior depends on external nondeterministic services.

Failure Modes

  • Solution backtracking and correct-to-failure degeneration if shortcuts are noisy.
  • Over-reliance on past experiences can repeat past mistakes on novel tasks.
  • Retrieval mismatch: retrieved example not relevant and degrades reasoning.

Core Entities

Models

  • GPT-3.5-Turbo
  • text-embedding-ada-002
  • GPT-4 (evaluator)

Metrics

  • Completeness
  • Executability
  • Consistency
  • Quality (product of three metrics)
  • Duration (s)

Datasets

  • SRDD (1,200 software requirements)

Context Entities

Models

  • MD5 (hashing for deduplication)