Overview
Production Readiness
0.35
Novelty Score
0.6
Cost Impact Score
0.45
Citation Count
3
Why It Matters For Business
Croto shows you can run multiple independent LLM teams, share and merge their intermediate outputs, and get measurably better code or narrative drafts—useful for prototyping, product ideation, and automating complex content that benefits from diverse perspectives.
Summary TLDR
This paper introduces Croto, a framework that runs many independent LLM-driven agent teams on the same task, pauses them at key phases to share outputs, groups proposals, prunes low-quality ones, and greedily aggregates strengths into a single improved solution. On 15 software tasks (SRDD) and 10 story tasks (ROCStories), Croto improves generation quality vs state-of-the-art multi-agent baselines. Key knobs: number of teams, per-team temperature to induce diversity, hierarchical partitioning to limit aggregation load, and greedy pruning to scale.
Problem Statement
Single-team LLM agent pipelines commit to one decision path per task and miss alternative, potentially better solution paths. Running many teams independently wastes opportunity for mutual insight and can overload aggregation. The paper asks how to let multiple independent LLM teams share intermediate results and synthesize them into superior final outputs without heavy task-specific customization.
Main Contribution
Croto: a multi-team orchestration framework that enables teams to exchange intermediate solutions at key phases and jointly aggregate them.
Hierarchy Partitioning and Greedy Aggregation (with a role-assigned aggregator) to group, prune, and synthesize diverse proposals while controlling context size.
Empirical results on code (SRDD) and story (ROCStories) tasks showing measurable quality gains and an analysis of team size, temperature diversity, and pruning effects.
Key Findings
Croto raises overall software quality over a strong multi-agent baseline (ChatDev).
Greedy pruning makes large team counts (8 teams) better and faster by removing lowquality proposals before aggregation.
Croto generalizes to story generation and improves narrative quality across metrics.
Results
Quality (software)
Executability (software)
Pruning effect on 8-team Croto (software)
Quality (stories)
Who Should Care
What To Try In 7 Days
Run a 4-team Croto prototype on a small coding task to assess quality vs your current agent pipeline.
Tune per-team temperature to mix conservative and creative settings, then compare outputs.
Add greedy pruning before aggregation to reduce bad proposals and speed up synthesis.
Agent Features
Memory
- short-term intra-phase exchange
Planning
- phase-based planning
Tool Use
- LLM-driven agents
Frameworks
- Greedy Aggregation
- Hierarchy Partitioning
- Pruning
Is Agentic
true
Architectures
- chain-as-team
- multi-team orchestration
Collaboration
- cross-team interaction
- intra-team sequential dialog
Optimization Features
Token Efficiency
- pruning reduces tokens by eliminating low-quality proposals
System Optimization
- hierarchy partitioning to limit aggregation context
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Greedy pruning may discard creative but useful solutions because automatic metrics are imperfect.
- Agents often choose simple, hard-coded implementations without detailed requirements; precise specs remain crucial.
- Coordination and compute costs grow with team size; scaling needs pruning and partitioning to stay practical.
When Not To Use
- When you require production-grade, safety-critical code without human review.
- In low-resource settings where running multiple teams is too expensive.
- For tasks that cannot be meaningfully decomposed into phasewise proposals or lack clear key phases.
Failure Modes
- Pruning removes promising long-tail proposals, reducing final innovation.
- Aggregator becomes overwhelmed by too many diverse proposals and synthesizes a worse solution.
- Agents default to trivial or hard-coded designs if task requirements are underspecified.
Core Entities
Models
- GPT-3.5-Turbo
Metrics
- Completeness
- Executability
- Consistency
- Quality
- Grammar and Fluency
- Context Relevance
- Logic Consistency
- Quality (stories)
Datasets
- SRDD
- ROCStories

