A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Overview

Decision SnapshotReady For Pilot

The single-agent baseline is practically useful when agents share the same base model and tools; evidence spans multiple datasets and both closed- and open-weight models.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Jiawei Xu, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, Ying Ding

Links

Abstract / PDF / Data

Why It Matters For Business

If your agent pipeline uses the same base LLM, run it as one multi-turn LLM: you often keep accuracy while cutting API/token cost and simplifying stack.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

Most multi-agent LLM workflows are homogeneous (same base model, different prompts/tools). The authors show a single LLM can role-play those agents in a multi-turn conversation, reuse the model's KV cache (attention state), match or slightly exceed multi-agent accuracy across 7 benchmarks, and reduce inference cost. They introduce OneFlow, an MCTS + dual-meta-LLM method that finds compact workflows optimized for single-agent execution. Limitation: true heterogeneous workflows (different base models) still cannot be simulated because KV caches cannot be shared.

Problem Statement

Are homogeneous multi-agent workflows (several agents built on the same base LLM) actually necessary, or can a single LLM simulate their behavior via multi-turn conversations and shared KV cache to keep accuracy while cutting inference cost?

Main Contribution

Empirical finding that a single LLM role-playing multiple homogeneous agents matches or slightly improves multi-agent performance on seven diverse benchmarks.

OneFlow: an automatic workflow search algorithm (MCTS + two meta-LLMs) that finds compact, cost-efficient workflows suited for single-agent execution.

Key Findings

A single LLM can match or slightly exceed homogeneous multi-agent performance on standard benchmarks.

NumbersHumanEval: OneFlow multi-agent 91.6% → OneFlow single-agent 92.1% (Table 1)

Practical UseTry implementing homogeneous multi-agent workflows as a single multi-turn agent first; you will likely retain accuracy and simplify deployment.

Evidence RefTable 1 (Main results with GPT-4o-mini)

Single-agent execution reduces inference cost substantially by reusing KV cache.

NumbersGSM8K cost: OneFlow multi $0.623 → OneFlow single-agent $0.387 (≈38% lower) (Table 2)

Practical UseIf agents use the same base model, switch to single-agent multi-turn execution to lower API spend and token expenditure.

Evidence RefTable 2 (Inference cost with GPT-4o-mini)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
pass@1 (HumanEval)	OneFlow (single-agent) 92.1% ±0.4	IO 89.1% ±0.4	+3.0 pts vs IO	HumanEval	Table 1: OneFlow (single-agent) 92.1% vs IO 89.1%	Table 1
F1 (DROP)	OneFlow (Claude 3.5) 87.5% ±0.0	AFlow (heterogeneous) 85.5% ±0.5	+2.0 pts vs heterogeneous AFlow	DROP	Table 3: OneFlow (Claude 3.5) 87.5% vs AFlow hetero 85.5%	Table 3

What To Try In 7 Days

Replace homogeneous multi-agent invocations with a single multi-turn LLM and compare accuracy and token costs on a held-out sample.

Run OneFlow (or a lightweight MCTS search) to compress long workflows into fewer, stronger agent roles and re-evaluate cost.

If using open-weight models, enable KV cache (vLLM) and measure latency/throughput trade-offs on multi-turn execution.

Agent Features

Memory

KV cache reuse (attention state caching)

Planning

task decomposition via workflow graphMCTS-based workflow search

Tool Use

sandboxed Python operatorsexternal tool calls routed by workflow

Frameworks

OneFlowAFlowvLLM

Is Agentic

Yes

Architectures

homogeneous multi-agent workflowsingle-LLM multi-turn simulation

Collaboration

role-playing through multi-turn conversationdesigner + critic meta-LLMs for workflow search

Optimization Features

Token Efficiency

reduced input tokens by compressing workflow prompts and rolesreuse of cached attention states to avoid re-encoding prefixes

Infra Optimization

batching via vLLMlong-context config (16k) for open-weight models

System Optimization

use vLLM for open-weight KV cache experimentssimulate KV cost for closed APIs

Inference Optimization

KV cache reuse across turnscompact workflows with fewer agent turnscompaction via deterministic summarization to limit context growth

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

HumanEvalMBPPGSM8KMATHHotpotQADROPTravelPlannerShopping-MMLU

Risks & Boundaries

Limitations

Single-agent simulation assumes agents share the same base LLM; it cannot simulate true heterogeneity because KV caches are model-specific.

KV-cache cost estimates for closed APIs are simulated using final message lists; real API runtimes may differ.

When Not To Use

When agent roles require different base models with genuinely distinct capabilities.

When tool side-effects are nondeterministic and break the single-agent simulation assumptions.

Failure Modes

Context bloat and prompt interference from very long multi-turn histories.

Non-deterministic tools or external side-effects violate the simulation proof assumptions and can change behavior.

Core Entities

Models

GPT-4o-miniClaude-3.5-HaikuClaude-4.0-SonnetQwen-3-8B

Metrics

pass@1F1solve rate (%)Accuracytask success rate (%)inference cost (USD)

Datasets

HumanEvalMBPPGSM8KMATHHotpotQADROPTravelPlannerShopping-MMLU

Benchmarks

HumanEvalMBPPGSM8KMATHHotpotQADROPTravelPlannerShopping-MMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A single LLM can match or slightly exceed homogeneous multi-agent performance on standard benchmarks.

Single-agent execution reduces inference cost substantially by reusing KV cache.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding

Assigning demographic personas to LLM agents can change decisions and cut task success by up to 26%

Key finding