A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

January 18, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Jiawei Xu, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, Ying Ding

Links

Abstract / PDF

Why It Matters For Business

If your agent pipeline uses the same base LLM, run it as one multi-turn LLM: you often keep accuracy while cutting API/token cost and simplifying stack.

Summary TLDR

Most multi-agent LLM workflows are homogeneous (same base model, different prompts/tools). The authors show a single LLM can role-play those agents in a multi-turn conversation, reuse the model's KV cache (attention state), match or slightly exceed multi-agent accuracy across 7 benchmarks, and reduce inference cost. They introduce OneFlow, an MCTS + dual-meta-LLM method that finds compact workflows optimized for single-agent execution. Limitation: true heterogeneous workflows (different base models) still cannot be simulated because KV caches cannot be shared.

Problem Statement

Are homogeneous multi-agent workflows (several agents built on the same base LLM) actually necessary, or can a single LLM simulate their behavior via multi-turn conversations and shared KV cache to keep accuracy while cutting inference cost?

Main Contribution

Empirical finding that a single LLM role-playing multiple homogeneous agents matches or slightly improves multi-agent performance on seven diverse benchmarks.

OneFlow: an automatic workflow search algorithm (MCTS + two meta-LLMs) that finds compact, cost-efficient workflows suited for single-agent execution.

Analysis showing single-agent execution gives substantial inference cost savings via KV-cache reuse, and clarifying the boundary where heterogeneity still matters.

Key Findings

A single LLM can match or slightly exceed homogeneous multi-agent performance on standard benchmarks.

NumbersHumanEval: OneFlow multi-agent 91.6% → OneFlow single-agent 92.1% (Table 1)

Single-agent execution reduces inference cost substantially by reusing KV cache.

NumbersGSM8K cost: OneFlow multi $0.623 → OneFlow single-agent $0.387 (≈38% lower) (Table 2)

KV-cache reuse helps maintain or improve performance and keeps latency/throughput stable on open models.

NumbersQwen-3 8B HumanEval: AFlow stateless 86.8% → AFlow single-agent 90.5% (Table 4)

Heterogeneous workflows remain a distinct regime; single-agent simulation cannot capture inter-model KV sharing.

NumbersDROP: Heterogeneous AFlow (GPT-4o mini + Claude 3.5) F1 85.5 vs OneFlow (Claude 3.5) F1 87.5 (Table 3)

Results

pass@1 (HumanEval)

ValueOneFlow (single-agent) 92.1% ±0.4

BaselineIO 89.1% ±0.4

F1 (DROP)

ValueOneFlow (Claude 3.5) 87.5% ±0.0

BaselineAFlow (heterogeneous) 85.5% ±0.5

Inference cost (USD, GSM8K)

ValueOneFlow (single-agent) $0.387 ±0.000

BaselineOneFlow multi-agent $0.623 ±0.001

pass@1 (Qwen-3 8B, HumanEval)

ValueAFlow (single-agent) 90.5%

BaselineAFlow (stateless calls) 86.8%

Who Should Care

What To Try In 7 Days

Replace homogeneous multi-agent invocations with a single multi-turn LLM and compare accuracy and token costs on a held-out sample.

Run OneFlow (or a lightweight MCTS search) to compress long workflows into fewer, stronger agent roles and re-evaluate cost.

If using open-weight models, enable KV cache (vLLM) and measure latency/throughput trade-offs on multi-turn execution.

Agent Features

Memory

  • KV cache reuse (attention state caching)

Planning

  • task decomposition via workflow graph
  • MCTS-based workflow search

Tool Use

  • sandboxed Python operators
  • external tool calls routed by workflow

Frameworks

  • OneFlow
  • AFlow
  • vLLM

Is Agentic

true

Architectures

  • homogeneous multi-agent workflow
  • single-LLM multi-turn simulation

Collaboration

  • role-playing through multi-turn conversation
  • designer + critic meta-LLMs for workflow search

Optimization Features

Token Efficiency

  • reduced input tokens by compressing workflow prompts and roles
  • reuse of cached attention states to avoid re-encoding prefixes

Infra Optimization

  • batching via vLLM
  • long-context config (16k) for open-weight models

System Optimization

  • use vLLM for open-weight KV cache experiments
  • simulate KV cost for closed APIs

Inference Optimization

  • KV cache reuse across turns
  • compact workflows with fewer agent turns
  • compaction via deterministic summarization to limit context growth

Reproducibility

Data Urls

  • HumanEval
  • MBPP
  • GSM8K
  • MATH
  • HotpotQA
  • DROP
  • TravelPlanner
  • Shopping-MMLU

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-agent simulation assumes agents share the same base LLM; it cannot simulate true heterogeneity because KV caches are model-specific.
  • KV-cache cost estimates for closed APIs are simulated using final message lists; real API runtimes may differ.
  • Automatic heterogeneous workflow experiments were a pilot and may not reflect perfectly optimized heterogeneous designs.

When Not To Use

  • When agent roles require different base models with genuinely distinct capabilities.
  • When tool side-effects are nondeterministic and break the single-agent simulation assumptions.
  • When strict per-turn process isolation or independent model state is required for auditing or security.

Failure Modes

  • Context bloat and prompt interference from very long multi-turn histories.
  • Non-deterministic tools or external side-effects violate the simulation proof assumptions and can change behavior.
  • Cost estimates optimistic when using simulated KV-cache for closed APIs; real-world latency may increase.

Core Entities

Models

  • GPT-4o-mini
  • Claude-3.5-Haiku
  • Claude-4.0-Sonnet
  • Qwen-3-8B

Metrics

  • pass@1
  • F1
  • solve rate (%)
  • Accuracy
  • task success rate (%)
  • inference cost (USD)

Datasets

  • HumanEval
  • MBPP
  • GSM8K
  • MATH
  • HotpotQA
  • DROP
  • TravelPlanner
  • Shopping-MMLU

Benchmarks

  • HumanEval
  • MBPP
  • GSM8K
  • MATH
  • HotpotQA
  • DROP
  • TravelPlanner
  • Shopping-MMLU