Croto: Orchestrating multiple LLM agent teams to jointly propose, prune, and synthesize better code and stories

June 13, 20246 min

Overview

Production Readiness

0.35

Novelty Score

0.6

Cost Impact Score

0.45

Citation Count

3

Authors

Zhuoyun Du, Chen Qian, Wei Liu, Zihao Xie, YiFei Wang, Rennai Qiu, Yufan Dang, Weize Chen, Cheng Yang, Ye Tian, Xuantang Xiong, Lei Han

Links

Abstract / PDF

Why It Matters For Business

Croto shows you can run multiple independent LLM teams, share and merge their intermediate outputs, and get measurably better code or narrative drafts—useful for prototyping, product ideation, and automating complex content that benefits from diverse perspectives.

Summary TLDR

This paper introduces Croto, a framework that runs many independent LLM-driven agent teams on the same task, pauses them at key phases to share outputs, groups proposals, prunes low-quality ones, and greedily aggregates strengths into a single improved solution. On 15 software tasks (SRDD) and 10 story tasks (ROCStories), Croto improves generation quality vs state-of-the-art multi-agent baselines. Key knobs: number of teams, per-team temperature to induce diversity, hierarchical partitioning to limit aggregation load, and greedy pruning to scale.

Problem Statement

Single-team LLM agent pipelines commit to one decision path per task and miss alternative, potentially better solution paths. Running many teams independently wastes opportunity for mutual insight and can overload aggregation. The paper asks how to let multiple independent LLM teams share intermediate results and synthesize them into superior final outputs without heavy task-specific customization.

Main Contribution

Croto: a multi-team orchestration framework that enables teams to exchange intermediate solutions at key phases and jointly aggregate them.

Hierarchy Partitioning and Greedy Aggregation (with a role-assigned aggregator) to group, prune, and synthesize diverse proposals while controlling context size.

Empirical results on code (SRDD) and story (ROCStories) tasks showing measurable quality gains and an analysis of team size, temperature diversity, and pruning effects.

Key Findings

Croto raises overall software quality over a strong multi-agent baseline (ChatDev).

NumbersQuality: Croto 0.840 vs ChatDev 0.779

Greedy pruning makes large team counts (8 teams) better and faster by removing lowquality proposals before aggregation.

Numbers8-team + Prune ΔQuality +0.065 (0.775 → 0.840); Executability +0.100

Croto generalizes to story generation and improves narrative quality across metrics.

NumbersStory Quality: 8-team Croto+Prune 3.642 vs Single-Team 2.358

Results

Quality (software)

Value0.840

BaselineChatDev 0.779

Executability (software)

Value0.928

BaselineChatDev 0.813

Pruning effect on 8-team Croto (software)

ValueQuality 0.840 (after prune)

BaselineVanilla 8-team Croto 0.775

Quality (stories)

Value3.642

BaselineSingle-Team 2.358

Who Should Care

What To Try In 7 Days

Run a 4-team Croto prototype on a small coding task to assess quality vs your current agent pipeline.

Tune per-team temperature to mix conservative and creative settings, then compare outputs.

Add greedy pruning before aggregation to reduce bad proposals and speed up synthesis.

Agent Features

Memory

  • short-term intra-phase exchange

Planning

  • phase-based planning

Tool Use

  • LLM-driven agents

Frameworks

  • Greedy Aggregation
  • Hierarchy Partitioning
  • Pruning

Is Agentic

true

Architectures

  • chain-as-team
  • multi-team orchestration

Collaboration

  • cross-team interaction
  • intra-team sequential dialog

Optimization Features

Token Efficiency

  • pruning reduces tokens by eliminating low-quality proposals

System Optimization

  • hierarchy partitioning to limit aggregation context

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Greedy pruning may discard creative but useful solutions because automatic metrics are imperfect.
  • Agents often choose simple, hard-coded implementations without detailed requirements; precise specs remain crucial.
  • Coordination and compute costs grow with team size; scaling needs pruning and partitioning to stay practical.

When Not To Use

  • When you require production-grade, safety-critical code without human review.
  • In low-resource settings where running multiple teams is too expensive.
  • For tasks that cannot be meaningfully decomposed into phasewise proposals or lack clear key phases.

Failure Modes

  • Pruning removes promising long-tail proposals, reducing final innovation.
  • Aggregator becomes overwhelmed by too many diverse proposals and synthesizes a worse solution.
  • Agents default to trivial or hard-coded designs if task requirements are underspecified.

Core Entities

Models

  • GPT-3.5-Turbo

Metrics

  • Completeness
  • Executability
  • Consistency
  • Quality
  • Grammar and Fluency
  • Context Relevance
  • Logic Consistency
  • Quality (stories)

Datasets

  • SRDD
  • ROCStories