Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Reusing vetted past fixes reduces developer iteration time and increases the chance generated prototypes are runnable, cutting manual triage and speeding prototyping.
Summary TLDR
The paper introduces Experiential Co-Learning: a two-role (instructor, assistant) multi-agent framework that records multi-step agent interactions as task graphs, extracts high-value non-adjacent transitions called "shortcuts" using compile and similarity signals, and retrieves those experiences as few-shot examples during future reasoning. On the SRDD software-requirement dataset, this approach raises a holistic quality metric from 0.4267 to 0.7304 and shortens development time versus strong multi-agent baselines. Code and data are available at the project's GitHub.
Problem Statement
Multi-agent coding systems treat each new task independently, causing repeated mistakes and wasted iterations because past cross-task experience is not captured or reused. The paper tackles how to design, collect and apply reusable experiences to make agent collaboration faster and more reliable.
Main Contribution
Proposes Experiential Co-Learning: co-tracking, co-memorizing, co-reasoning to collect and reuse agent experiences.
Introduces task-execution graphs and extracts heuristic non-adjacent 'shortcuts' (compile + similarity filtered) as key experiences.
Shows empirical gains on SRDD tasks, ablations that quantify roles of instructor vs assistant, and sensitivity of retrieval hyperparameters.
Key Findings
Experience reuse almost doubles the holistic software quality metric versus a strong multi-agent baseline.
Completeness and executability improve substantially when agents reuse shortcuts.
Assistant experience matters more than instructor-only experience.
Reusing shortcuts reduces iterations and wall time compared to a strong multi-agent system.
Results
Quality
Completeness
Executability
Consistency
Duration (s)
Who Should Care
What To Try In 7 Days
Log agent instruction/solution pairs during multi-turn runs.
Build a simple deduplicated task graph using a hash of code snapshots.
Keep shortcuts that compile and match requirements; store as key-value experiences (instruction->solution and solution->instruction).
Agent Features
Memory
- experience pools (key-value shortcut memories)
Planning
- multi-turn planning via iterative instruction-solution cycles
Tool Use
- external compiler and code checker
- embedding-based retrieval
Frameworks
- co-tracking
- co-memorizing
- co-reasoning
Is Agentic
true
Architectures
- two-role instructor-assistant multi-agent
Collaboration
- role-based multi-turn communication
- few-shot example exchange
Optimization Features
Token Efficiency
- use of single best code example (k_code=1) reduces context size
Training Optimization
- heuristic shortcut selection to focus useful examples
Inference Optimization
- retrieve top-k experiences to build in-context examples
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Agents tend to implement simple logic; suitable for prototypes not full production systems.
- Evaluation uses SRDD and compile-based checks; lacks broad real-world validation.
- Consistency metric depends on coarse embeddings and may miss subtle requirement mismatches.
- Manual verification remains necessary for general-purpose software.
When Not To Use
- For safety-critical or production systems without human review.
- When requirements are vague or require complex domain reasoning.
- When software behavior depends on external nondeterministic services.
Failure Modes
- Solution backtracking and correct-to-failure degeneration if shortcuts are noisy.
- Over-reliance on past experiences can repeat past mistakes on novel tasks.
- Retrieval mismatch: retrieved example not relevant and degrades reasoning.
Core Entities
Models
- GPT-3.5-Turbo
- text-embedding-ada-002
- GPT-4 (evaluator)
Metrics
- Completeness
- Executability
- Consistency
- Quality (product of three metrics)
- Duration (s)
Datasets
- SRDD (1,200 software requirements)
Context Entities
Models
- MD5 (hashing for deduplication)

