Use mined "shortcuts" from past multi-agent runs to cut tokens and speed up code generation

May 28, 20257 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Rennai Qiu, Chen Qian, Ran Li, Yufan Dang, Weize Chen, Cheng Yang, Yingli Zhang, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun

Links

Abstract / PDF

Why It Matters For Business

Co-Saving can cut token bills and developer compute costs by reusing prior multi-agent transitions, while keeping or improving code quality on similar tasks, so teams can scale automated software generation under a fixed budget.

Summary TLDR

Co-Saving adds a small memory of past successful agent interactions (called "shortcuts") to multi-agent software-development systems. It ranks shortcuts by value vs cost (time and token usage), applies a dynamic emergency factor tied to remaining budget, and forces termination when interaction cost hits reference limits. On the SRDD software tasks, Co-Saving reports a large cut in token use and higher overall code quality versus prior multi-agent systems, while ablations show shortcut selection and the emergency factor materially affect success and budget completion.

Problem Statement

Multi-agent systems for software development produce good results but often waste tokens and time through redundant interactions. The paper aims to make multi-agent collaboration resource-aware so agents can reuse prior successful transitions to save tokens/time while keeping or improving code quality.

Main Contribution

Introduce "shortcuts": instruction fragments mined from historical multi-agent trajectories that connect non-adjacent solution states and can bypass redundant reasoning steps.

Design a value-vs-cost scoring and filtering pipeline (time, tokens normalized, harmonic mean) plus an "emergency factor" that weights cost more as budget depletes.

Integrate shortcut retrieval into an existing multi-agent software-dev pipeline and show empirical gains on the SRDD dataset versus single- and multi-agent baselines.

Key Findings

Co-Saving reduces token usage versus ChatDev.

Numbers50.85% average reduction in tokens (paper abstract).

Co-Saving improves measured overall code quality versus ChatDev.

NumbersPaper reports a 10.06% improvement in overall code quality (abstract).

Shortcut selection and emergency weighting materially affect budgeted completion and quality.

NumbersAblation: full model BCR 0.8 vs selection-removed BCR 0.6; full Quality 0.5453 vs selection-removed 0.4826 (Table 2).

Results

Token usage reduction vs ChatDev

Value50.85% reduction

BaselineChatDev

Overall code quality improvement vs ChatDev

Value10.06% improvement

BaselineChatDev

Quality (Co-Saving) - Table 1

Value0.2515

BaselineChatDev 0.151

BCR (Budgeted Completion Rate) - Table 1

Value0.728

BaselineChatDev 0.016

Ablation - selection removed (BCR / Quality)

ValueBCR 0.6; Quality 0.4826

BaselineFull model BCR 0.8; Quality 0.5453

Who Should Care

What To Try In 7 Days

Log agent interactions as (state, instruction, next state) triples and build a small shortcut index from past successful tasks.

Implement a cheap embedding retrieval (text-embedding-ada-002 or similar) to find reference tasks for new requirements.

Add simple cost filters: estimate token/time cost for candidate shortcuts and drop those exceeding remaining budget; test forced termination thresholds.

Agent Features

Memory

  • reference task retrieval (shortcut memory)

Planning

  • task decomposition
  • reference-guided plan shortcuts

Tool Use

  • external code compilation/execution environment
  • semantic embeddings for retrieval

Frameworks

  • ChatDev (used as base for experiments)
  • MetaGPT (baseline)

Is Agentic

true

Architectures

  • multi-agent system (role-based agents)

Collaboration

  • iterative instruction-exchange (chat chain)
  • role assignment (programmer/reviewer)

Optimization Features

Token Efficiency

  • token-aware shortcut filtering
  • normalization and ranking of token/time cost

System Optimization

  • budget-aware emergency factor to shift priorities

Inference Optimization

  • interaction pruning via shortcuts
  • forced termination when path length exceeds reference

Reproducibility

Data Urls

  • SRDD dataset referenced via [9] (ChatDev paper)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on finding similar historical tasks; cold-start tasks get no shortcut benefit.
  • Embedding-based similarity may miss fine-grained code semantics and produce imperfect matches.
  • Forced termination can trade completeness for budget adherence, reducing implementation detail on hard tasks.

When Not To Use

  • For novel tasks without historical analogs in the shortcut store.
  • When budgets are so large that extra reasoning improves quality and cost is irrelevant.
  • For safety-critical code where any shortcuted change must be human-reviewed.

Failure Modes

  • Applying an incorrect shortcut that produces semantically wrong code despite compiling.
  • Over-pruning useful interactions and returning incomplete implementations.
  • Embedding retrieval bias causing repeated reuse of suboptimal historical fixes.

Core Entities

Models

  • GPT-3.5-Turbo
  • GPT-4
  • LLaMA 3 70B
  • GPT-Engineer
  • ReAct
  • MetaGPT
  • ChatDev
  • Co-Saving (this work)

Metrics

  • Completeness
  • Executability
  • Consistency
  • Granularity
  • Quality
  • BCR (Budgeted Completion Rate)

Datasets

  • SRDD (subset used for training shortcuts and testing)