Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Co-Saving can cut token bills and developer compute costs by reusing prior multi-agent transitions, while keeping or improving code quality on similar tasks, so teams can scale automated software generation under a fixed budget.
Summary TLDR
Co-Saving adds a small memory of past successful agent interactions (called "shortcuts") to multi-agent software-development systems. It ranks shortcuts by value vs cost (time and token usage), applies a dynamic emergency factor tied to remaining budget, and forces termination when interaction cost hits reference limits. On the SRDD software tasks, Co-Saving reports a large cut in token use and higher overall code quality versus prior multi-agent systems, while ablations show shortcut selection and the emergency factor materially affect success and budget completion.
Problem Statement
Multi-agent systems for software development produce good results but often waste tokens and time through redundant interactions. The paper aims to make multi-agent collaboration resource-aware so agents can reuse prior successful transitions to save tokens/time while keeping or improving code quality.
Main Contribution
Introduce "shortcuts": instruction fragments mined from historical multi-agent trajectories that connect non-adjacent solution states and can bypass redundant reasoning steps.
Design a value-vs-cost scoring and filtering pipeline (time, tokens normalized, harmonic mean) plus an "emergency factor" that weights cost more as budget depletes.
Integrate shortcut retrieval into an existing multi-agent software-dev pipeline and show empirical gains on the SRDD dataset versus single- and multi-agent baselines.
Key Findings
Co-Saving reduces token usage versus ChatDev.
Co-Saving improves measured overall code quality versus ChatDev.
Shortcut selection and emergency weighting materially affect budgeted completion and quality.
Results
Token usage reduction vs ChatDev
Overall code quality improvement vs ChatDev
Quality (Co-Saving) - Table 1
BCR (Budgeted Completion Rate) - Table 1
Ablation - selection removed (BCR / Quality)
Who Should Care
What To Try In 7 Days
Log agent interactions as (state, instruction, next state) triples and build a small shortcut index from past successful tasks.
Implement a cheap embedding retrieval (text-embedding-ada-002 or similar) to find reference tasks for new requirements.
Add simple cost filters: estimate token/time cost for candidate shortcuts and drop those exceeding remaining budget; test forced termination thresholds.
Agent Features
Memory
- reference task retrieval (shortcut memory)
Planning
- task decomposition
- reference-guided plan shortcuts
Tool Use
- external code compilation/execution environment
- semantic embeddings for retrieval
Frameworks
- ChatDev (used as base for experiments)
- MetaGPT (baseline)
Is Agentic
true
Architectures
- multi-agent system (role-based agents)
Collaboration
- iterative instruction-exchange (chat chain)
- role assignment (programmer/reviewer)
Optimization Features
Token Efficiency
- token-aware shortcut filtering
- normalization and ranking of token/time cost
System Optimization
- budget-aware emergency factor to shift priorities
Inference Optimization
- interaction pruning via shortcuts
- forced termination when path length exceeds reference
Reproducibility
Data Urls
- SRDD dataset referenced via [9] (ChatDev paper)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relies on finding similar historical tasks; cold-start tasks get no shortcut benefit.
- Embedding-based similarity may miss fine-grained code semantics and produce imperfect matches.
- Forced termination can trade completeness for budget adherence, reducing implementation detail on hard tasks.
When Not To Use
- For novel tasks without historical analogs in the shortcut store.
- When budgets are so large that extra reasoning improves quality and cost is irrelevant.
- For safety-critical code where any shortcuted change must be human-reviewed.
Failure Modes
- Applying an incorrect shortcut that produces semantically wrong code despite compiling.
- Over-pruning useful interactions and returning incomplete implementations.
- Embedding retrieval bias causing repeated reuse of suboptimal historical fixes.
Core Entities
Models
- GPT-3.5-Turbo
- GPT-4
- LLaMA 3 70B
- GPT-Engineer
- ReAct
- MetaGPT
- ChatDev
- Co-Saving (this work)
Metrics
- Completeness
- Executability
- Consistency
- Granularity
- Quality
- BCR (Budgeted Completion Rate)
Datasets
- SRDD (subset used for training shortcuts and testing)

