Overview
The method shows consistent quality gains on two domains and ablations; however, it relies on LLMs (GPT-3.5) and heuristics (pruning, aggregation), so apply carefully and validate on your workloads.
Citations3
Evidence Strength0.75
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 35%
Novelty: 60%
Why It Matters For Business
Croto shows you can run multiple independent LLM teams, share and merge their intermediate outputs, and get measurably better code or narrative drafts—useful for prototyping, product ideation, and automating complex content that benefits from diverse perspectives.
Who Should Care
Summary TLDR
This paper introduces Croto, a framework that runs many independent LLM-driven agent teams on the same task, pauses them at key phases to share outputs, groups proposals, prunes low-quality ones, and greedily aggregates strengths into a single improved solution. On 15 software tasks (SRDD) and 10 story tasks (ROCStories), Croto improves generation quality vs state-of-the-art multi-agent baselines. Key knobs: number of teams, per-team temperature to induce diversity, hierarchical partitioning to limit aggregation load, and greedy pruning to scale.
Problem Statement
Single-team LLM agent pipelines commit to one decision path per task and miss alternative, potentially better solution paths. Running many teams independently wastes opportunity for mutual insight and can overload aggregation. The paper asks how to let multiple independent LLM teams share intermediate results and synthesize them into superior final outputs without heavy task-specific customization.
Main Contribution
Croto: a multi-team orchestration framework that enables teams to exchange intermediate solutions at key phases and jointly aggregate them.
Hierarchy Partitioning and Greedy Aggregation (with a role-assigned aggregator) to group, prune, and synthesize diverse proposals while controlling context size.
Key Findings
Croto raises overall software quality over a strong multi-agent baseline (ChatDev).
Greedy pruning makes large team counts (8 teams) better and faster by removing lowquality proposals before aggregation.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Quality (software) | 0.840 | ChatDev 0.779 | +0.061 | Average across 15 SRDD tasks | Croto vs ChatDev reported in Table 1 | Table 1 |
| Executability (software) | 0.928 | ChatDev 0.813 | +0.115 | Average across 15 SRDD tasks | Reported in Table 1 | Table 1 |
What To Try In 7 Days
Run a 4-team Croto prototype on a small coding task to assess quality vs your current agent pipeline.
Tune per-team temperature to mix conservative and creative settings, then compare outputs.
Add greedy pruning before aggregation to reduce bad proposals and speed up synthesis.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Reproducibility
Risks & Boundaries
Limitations
Greedy pruning may discard creative but useful solutions because automatic metrics are imperfect.
Agents often choose simple, hard-coded implementations without detailed requirements; precise specs remain crucial.
When Not To Use
When you require production-grade, safety-critical code without human review.
In low-resource settings where running multiple teams is too expensive.
Failure Modes
Pruning removes promising long-tail proposals, reducing final innovation.
Aggregator becomes overwhelmed by too many diverse proposals and synthesizes a worse solution.

