Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
If your agent pipeline uses the same base LLM, run it as one multi-turn LLM: you often keep accuracy while cutting API/token cost and simplifying stack.
Summary TLDR
Most multi-agent LLM workflows are homogeneous (same base model, different prompts/tools). The authors show a single LLM can role-play those agents in a multi-turn conversation, reuse the model's KV cache (attention state), match or slightly exceed multi-agent accuracy across 7 benchmarks, and reduce inference cost. They introduce OneFlow, an MCTS + dual-meta-LLM method that finds compact workflows optimized for single-agent execution. Limitation: true heterogeneous workflows (different base models) still cannot be simulated because KV caches cannot be shared.
Problem Statement
Are homogeneous multi-agent workflows (several agents built on the same base LLM) actually necessary, or can a single LLM simulate their behavior via multi-turn conversations and shared KV cache to keep accuracy while cutting inference cost?
Main Contribution
Empirical finding that a single LLM role-playing multiple homogeneous agents matches or slightly improves multi-agent performance on seven diverse benchmarks.
OneFlow: an automatic workflow search algorithm (MCTS + two meta-LLMs) that finds compact, cost-efficient workflows suited for single-agent execution.
Analysis showing single-agent execution gives substantial inference cost savings via KV-cache reuse, and clarifying the boundary where heterogeneity still matters.
Key Findings
A single LLM can match or slightly exceed homogeneous multi-agent performance on standard benchmarks.
Single-agent execution reduces inference cost substantially by reusing KV cache.
KV-cache reuse helps maintain or improve performance and keeps latency/throughput stable on open models.
Heterogeneous workflows remain a distinct regime; single-agent simulation cannot capture inter-model KV sharing.
Results
pass@1 (HumanEval)
F1 (DROP)
Inference cost (USD, GSM8K)
pass@1 (Qwen-3 8B, HumanEval)
Who Should Care
What To Try In 7 Days
Replace homogeneous multi-agent invocations with a single multi-turn LLM and compare accuracy and token costs on a held-out sample.
Run OneFlow (or a lightweight MCTS search) to compress long workflows into fewer, stronger agent roles and re-evaluate cost.
If using open-weight models, enable KV cache (vLLM) and measure latency/throughput trade-offs on multi-turn execution.
Agent Features
Memory
- KV cache reuse (attention state caching)
Planning
- task decomposition via workflow graph
- MCTS-based workflow search
Tool Use
- sandboxed Python operators
- external tool calls routed by workflow
Frameworks
- OneFlow
- AFlow
- vLLM
Is Agentic
true
Architectures
- homogeneous multi-agent workflow
- single-LLM multi-turn simulation
Collaboration
- role-playing through multi-turn conversation
- designer + critic meta-LLMs for workflow search
Optimization Features
Token Efficiency
- reduced input tokens by compressing workflow prompts and roles
- reuse of cached attention states to avoid re-encoding prefixes
Infra Optimization
- batching via vLLM
- long-context config (16k) for open-weight models
System Optimization
- use vLLM for open-weight KV cache experiments
- simulate KV cost for closed APIs
Inference Optimization
- KV cache reuse across turns
- compact workflows with fewer agent turns
- compaction via deterministic summarization to limit context growth
Reproducibility
Data Urls
- HumanEval
- MBPP
- GSM8K
- MATH
- HotpotQA
- DROP
- TravelPlanner
- Shopping-MMLU
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single-agent simulation assumes agents share the same base LLM; it cannot simulate true heterogeneity because KV caches are model-specific.
- KV-cache cost estimates for closed APIs are simulated using final message lists; real API runtimes may differ.
- Automatic heterogeneous workflow experiments were a pilot and may not reflect perfectly optimized heterogeneous designs.
When Not To Use
- When agent roles require different base models with genuinely distinct capabilities.
- When tool side-effects are nondeterministic and break the single-agent simulation assumptions.
- When strict per-turn process isolation or independent model state is required for auditing or security.
Failure Modes
- Context bloat and prompt interference from very long multi-turn histories.
- Non-deterministic tools or external side-effects violate the simulation proof assumptions and can change behavior.
- Cost estimates optimistic when using simulated KV-cache for closed APIs; real-world latency may increase.
Core Entities
Models
- GPT-4o-mini
- Claude-3.5-Haiku
- Claude-4.0-Sonnet
- Qwen-3-8B
Metrics
- pass@1
- F1
- solve rate (%)
- Accuracy
- task success rate (%)
- inference cost (USD)
Datasets
- HumanEval
- MBPP
- GSM8K
- MATH
- HotpotQA
- DROP
- TravelPlanner
- Shopping-MMLU
Benchmarks
- HumanEval
- MBPP
- GSM8K
- MATH
- HotpotQA
- DROP
- TravelPlanner
- Shopping-MMLU

