Croto: Orchestrating multiple LLM agent teams to jointly propose, prune, and synthesize better code and stories

Overview

Decision SnapshotNeeds Validation

The method shows consistent quality gains on two domains and ablations; however, it relies on LLMs (GPT-3.5) and heuristics (pruning, aggregation), so apply carefully and validate on your workloads.

Citations3

Evidence Strength0.75

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 35%

Novelty: 60%

Authors

Zhuoyun Du, Chen Qian, Wei Liu, Zihao Xie, YiFei Wang, Rennai Qiu, Yufan Dang, Weize Chen, Cheng Yang, Ye Tian, Xuantang Xiong, Lei Han

Links

Abstract / PDF / Code

Why It Matters For Business

Croto shows you can run multiple independent LLM teams, share and merge their intermediate outputs, and get measurably better code or narrative drafts—useful for prototyping, product ideation, and automating complex content that benefits from diverse perspectives.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper introduces Croto, a framework that runs many independent LLM-driven agent teams on the same task, pauses them at key phases to share outputs, groups proposals, prunes low-quality ones, and greedily aggregates strengths into a single improved solution. On 15 software tasks (SRDD) and 10 story tasks (ROCStories), Croto improves generation quality vs state-of-the-art multi-agent baselines. Key knobs: number of teams, per-team temperature to induce diversity, hierarchical partitioning to limit aggregation load, and greedy pruning to scale.

Problem Statement

Single-team LLM agent pipelines commit to one decision path per task and miss alternative, potentially better solution paths. Running many teams independently wastes opportunity for mutual insight and can overload aggregation. The paper asks how to let multiple independent LLM teams share intermediate results and synthesize them into superior final outputs without heavy task-specific customization.

Main Contribution

Croto: a multi-team orchestration framework that enables teams to exchange intermediate solutions at key phases and jointly aggregate them.

Hierarchy Partitioning and Greedy Aggregation (with a role-assigned aggregator) to group, prune, and synthesize diverse proposals while controlling context size.

Key Findings

Croto raises overall software quality over a strong multi-agent baseline (ChatDev).

NumbersQuality: Croto 0.840 vs ChatDev 0.779

Practical UseUse cross-team interactions and aggregation to improve generated code quality; expect modest to meaningful gains over single-team pipelines.

Evidence RefTable 1

Greedy pruning makes large team counts (8 teams) better and faster by removing lowquality proposals before aggregation.

Numbers8-team + Prune ΔQuality +0.065 (0.775 → 0.840); Executability +0.100

Practical UseIf you scale to many teams, add a pruning step to drop weak proposals so aggregators aren't overwhelmed.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Quality (software)	0.840	ChatDev 0.779	+0.061	Average across 15 SRDD tasks	Croto vs ChatDev reported in Table 1	Table 1
Executability (software)	0.928	ChatDev 0.813	+0.115	Average across 15 SRDD tasks	Reported in Table 1	Table 1

What To Try In 7 Days

Run a 4-team Croto prototype on a small coding task to assess quality vs your current agent pipeline.

Tune per-team temperature to mix conservative and creative settings, then compare outputs.

Add greedy pruning before aggregation to reduce bad proposals and speed up synthesis.

Agent Features

Memory

short-term intra-phase exchange

Planning

phase-based planning

Tool Use

LLM-driven agents

Frameworks

Greedy AggregationHierarchy PartitioningPruning

Is Agentic

Yes

Architectures

chain-as-teammulti-team orchestration

Collaboration

cross-team interactionintra-team sequential dialog

Optimization Features

Token Efficiency

pruning reduces tokens by eliminating low-quality proposals

System Optimization

hierarchy partitioning to limit aggregation context

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/OpenBMB/ChatDev/tree/macnet

Risks & Boundaries

Limitations

Greedy pruning may discard creative but useful solutions because automatic metrics are imperfect.

Agents often choose simple, hard-coded implementations without detailed requirements; precise specs remain crucial.

When Not To Use

When you require production-grade, safety-critical code without human review.

In low-resource settings where running multiple teams is too expensive.

Failure Modes

Pruning removes promising long-tail proposals, reducing final innovation.

Aggregator becomes overwhelmed by too many diverse proposals and synthesizes a worse solution.

Core Entities

Models

GPT-3.5-Turbo

Metrics

CompletenessExecutabilityConsistencyQualityGrammar and FluencyContext RelevanceLogic ConsistencyQuality (stories)

Datasets

SRDDROCStories

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Croto raises overall software quality over a strong multi-agent baseline (ChatDev).

Greedy pruning makes large team counts (8 teams) better and faster by removing lowquality proposals before aggregation.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding