Croto: Orchestrating multiple LLM agent teams to jointly propose, prune, and synthesize better code and stories

June 13, 20246 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent quality gains on two domains and ablations; however, it relies on LLMs (GPT-3.5) and heuristics (pruning, aggregation), so apply carefully and validate on your workloads.

Citations3

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 35%

Novelty: 60%

Authors

Zhuoyun Du, Chen Qian, Wei Liu, Zihao Xie, YiFei Wang, Rennai Qiu, Yufan Dang, Weize Chen, Cheng Yang, Ye Tian, Xuantang Xiong, Lei Han

Links

Abstract / PDF / Code

Why It Matters For Business

Croto shows you can run multiple independent LLM teams, share and merge their intermediate outputs, and get measurably better code or narrative drafts—useful for prototyping, product ideation, and automating complex content that benefits from diverse perspectives.

Who Should Care

Summary TLDR

This paper introduces Croto, a framework that runs many independent LLM-driven agent teams on the same task, pauses them at key phases to share outputs, groups proposals, prunes low-quality ones, and greedily aggregates strengths into a single improved solution. On 15 software tasks (SRDD) and 10 story tasks (ROCStories), Croto improves generation quality vs state-of-the-art multi-agent baselines. Key knobs: number of teams, per-team temperature to induce diversity, hierarchical partitioning to limit aggregation load, and greedy pruning to scale.

Problem Statement

Single-team LLM agent pipelines commit to one decision path per task and miss alternative, potentially better solution paths. Running many teams independently wastes opportunity for mutual insight and can overload aggregation. The paper asks how to let multiple independent LLM teams share intermediate results and synthesize them into superior final outputs without heavy task-specific customization.

Main Contribution

Croto: a multi-team orchestration framework that enables teams to exchange intermediate solutions at key phases and jointly aggregate them.

Hierarchy Partitioning and Greedy Aggregation (with a role-assigned aggregator) to group, prune, and synthesize diverse proposals while controlling context size.

Key Findings

Croto raises overall software quality over a strong multi-agent baseline (ChatDev).

NumbersQuality: Croto 0.840 vs ChatDev 0.779

Practical UseUse cross-team interactions and aggregation to improve generated code quality; expect modest to meaningful gains over single-team pipelines.

Evidence RefTable 1

Greedy pruning makes large team counts (8 teams) better and faster by removing lowquality proposals before aggregation.

Numbers8-team + Prune ΔQuality +0.065 (0.7750.840); Executability +0.100

Practical UseIf you scale to many teams, add a pruning step to drop weak proposals so aggregators aren't overwhelmed.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Quality (software)0.840ChatDev 0.779+0.061Average across 15 SRDD tasksCroto vs ChatDev reported in Table 1Table 1
Executability (software)0.928ChatDev 0.813+0.115Average across 15 SRDD tasksReported in Table 1Table 1

What To Try In 7 Days

Run a 4-team Croto prototype on a small coding task to assess quality vs your current agent pipeline.

Tune per-team temperature to mix conservative and creative settings, then compare outputs.

Add greedy pruning before aggregation to reduce bad proposals and speed up synthesis.

Agent Features

Memory
short-term intra-phase exchange
Planning
phase-based planning
Tool Use
LLM-driven agents
Frameworks
Greedy AggregationHierarchy PartitioningPruning
Is Agentic

Yes

Architectures
chain-as-teammulti-team orchestration
Collaboration
cross-team interactionintra-team sequential dialog

Optimization Features

Token Efficiency
pruning reduces tokens by eliminating low-quality proposals
System Optimization
hierarchy partitioning to limit aggregation context

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Greedy pruning may discard creative but useful solutions because automatic metrics are imperfect.

Agents often choose simple, hard-coded implementations without detailed requirements; precise specs remain crucial.

When Not To Use

When you require production-grade, safety-critical code without human review.

In low-resource settings where running multiple teams is too expensive.

Failure Modes

Pruning removes promising long-tail proposals, reducing final innovation.

Aggregator becomes overwhelmed by too many diverse proposals and synthesizes a worse solution.

Core Entities

Models

GPT-3.5-Turbo

Metrics

CompletenessExecutabilityConsistencyQualityGrammar and FluencyContext RelevanceLogic ConsistencyQuality (stories)

Datasets

SRDDROCStories