Overview
The single-agent baseline is practically useful when agents share the same base model and tools; evidence spans multiple datasets and both closed- and open-weight models.
Citations2
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
If your agent pipeline uses the same base LLM, run it as one multi-turn LLM: you often keep accuracy while cutting API/token cost and simplifying stack.
Who Should Care
Summary TLDR
Most multi-agent LLM workflows are homogeneous (same base model, different prompts/tools). The authors show a single LLM can role-play those agents in a multi-turn conversation, reuse the model's KV cache (attention state), match or slightly exceed multi-agent accuracy across 7 benchmarks, and reduce inference cost. They introduce OneFlow, an MCTS + dual-meta-LLM method that finds compact workflows optimized for single-agent execution. Limitation: true heterogeneous workflows (different base models) still cannot be simulated because KV caches cannot be shared.
Problem Statement
Are homogeneous multi-agent workflows (several agents built on the same base LLM) actually necessary, or can a single LLM simulate their behavior via multi-turn conversations and shared KV cache to keep accuracy while cutting inference cost?
Main Contribution
Empirical finding that a single LLM role-playing multiple homogeneous agents matches or slightly improves multi-agent performance on seven diverse benchmarks.
OneFlow: an automatic workflow search algorithm (MCTS + two meta-LLMs) that finds compact, cost-efficient workflows suited for single-agent execution.
Key Findings
A single LLM can match or slightly exceed homogeneous multi-agent performance on standard benchmarks.
Single-agent execution reduces inference cost substantially by reusing KV cache.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| pass@1 (HumanEval) | OneFlow (single-agent) 92.1% ±0.4 | IO 89.1% ±0.4 | +3.0 pts vs IO | HumanEval | Table 1: OneFlow (single-agent) 92.1% vs IO 89.1% | Table 1 |
| F1 (DROP) | OneFlow (Claude 3.5) 87.5% ±0.0 | AFlow (heterogeneous) 85.5% ±0.5 | +2.0 pts vs heterogeneous AFlow | DROP | Table 3: OneFlow (Claude 3.5) 87.5% vs AFlow hetero 85.5% | Table 3 |
What To Try In 7 Days
Replace homogeneous multi-agent invocations with a single multi-turn LLM and compare accuracy and token costs on a held-out sample.
Run OneFlow (or a lightweight MCTS search) to compress long workflows into fewer, stronger agent roles and re-evaluate cost.
If using open-weight models, enable KV cache (vLLM) and measure latency/throughput trade-offs on multi-turn execution.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Single-agent simulation assumes agents share the same base LLM; it cannot simulate true heterogeneity because KV caches are model-specific.
KV-cache cost estimates for closed APIs are simulated using final message lists; real API runtimes may differ.
When Not To Use
When agent roles require different base models with genuinely distinct capabilities.
When tool side-effects are nondeterministic and break the single-agent simulation assumptions.
Failure Modes
Context bloat and prompt interference from very long multi-turn histories.
Non-deterministic tools or external side-effects violate the simulation proof assumptions and can change behavior.

