A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

January 18, 20268 min

Overview

Decision SnapshotReady For Pilot

The single-agent baseline is practically useful when agents share the same base model and tools; evidence spans multiple datasets and both closed- and open-weight models.

Citations2

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Jiawei Xu, Arief Koesdwiady, Sisong Bei, Yan Han, Baixiang Huang, Dakuo Wang, Yutong Chen, Zheshen Wang, Peihao Wang, Pan Li, Ying Ding

Links

Abstract / PDF / Data

Why It Matters For Business

If your agent pipeline uses the same base LLM, run it as one multi-turn LLM: you often keep accuracy while cutting API/token cost and simplifying stack.

Who Should Care

Summary TLDR

Most multi-agent LLM workflows are homogeneous (same base model, different prompts/tools). The authors show a single LLM can role-play those agents in a multi-turn conversation, reuse the model's KV cache (attention state), match or slightly exceed multi-agent accuracy across 7 benchmarks, and reduce inference cost. They introduce OneFlow, an MCTS + dual-meta-LLM method that finds compact workflows optimized for single-agent execution. Limitation: true heterogeneous workflows (different base models) still cannot be simulated because KV caches cannot be shared.

Problem Statement

Are homogeneous multi-agent workflows (several agents built on the same base LLM) actually necessary, or can a single LLM simulate their behavior via multi-turn conversations and shared KV cache to keep accuracy while cutting inference cost?

Main Contribution

Empirical finding that a single LLM role-playing multiple homogeneous agents matches or slightly improves multi-agent performance on seven diverse benchmarks.

OneFlow: an automatic workflow search algorithm (MCTS + two meta-LLMs) that finds compact, cost-efficient workflows suited for single-agent execution.

Key Findings

A single LLM can match or slightly exceed homogeneous multi-agent performance on standard benchmarks.

NumbersHumanEval: OneFlow multi-agent 91.6% → OneFlow single-agent 92.1% (Table 1)

Practical UseTry implementing homogeneous multi-agent workflows as a single multi-turn agent first; you will likely retain accuracy and simplify deployment.

Evidence RefTable 1 (Main results with GPT-4o-mini)

Single-agent execution reduces inference cost substantially by reusing KV cache.

NumbersGSM8K cost: OneFlow multi $0.623 → OneFlow single-agent $0.387 (≈38% lower) (Table 2)

Practical UseIf agents use the same base model, switch to single-agent multi-turn execution to lower API spend and token expenditure.

Evidence RefTable 2 (Inference cost with GPT-4o-mini)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
pass@1 (HumanEval)OneFlow (single-agent) 92.1% ±0.4IO 89.1% ±0.4+3.0 pts vs IOHumanEvalTable 1: OneFlow (single-agent) 92.1% vs IO 89.1%Table 1
F1 (DROP)OneFlow (Claude 3.5) 87.5% ±0.0AFlow (heterogeneous) 85.5% ±0.5+2.0 pts vs heterogeneous AFlowDROPTable 3: OneFlow (Claude 3.5) 87.5% vs AFlow hetero 85.5%Table 3

What To Try In 7 Days

Replace homogeneous multi-agent invocations with a single multi-turn LLM and compare accuracy and token costs on a held-out sample.

Run OneFlow (or a lightweight MCTS search) to compress long workflows into fewer, stronger agent roles and re-evaluate cost.

If using open-weight models, enable KV cache (vLLM) and measure latency/throughput trade-offs on multi-turn execution.

Agent Features

Memory
KV cache reuse (attention state caching)
Planning
task decomposition via workflow graphMCTS-based workflow search
Tool Use
sandboxed Python operatorsexternal tool calls routed by workflow
Frameworks
OneFlowAFlowvLLM
Is Agentic

Yes

Architectures
homogeneous multi-agent workflowsingle-LLM multi-turn simulation
Collaboration
role-playing through multi-turn conversationdesigner + critic meta-LLMs for workflow search

Optimization Features

Token Efficiency
reduced input tokens by compressing workflow prompts and rolesreuse of cached attention states to avoid re-encoding prefixes
Infra Optimization
batching via vLLMlong-context config (16k) for open-weight models
System Optimization
use vLLM for open-weight KV cache experimentssimulate KV cost for closed APIs
Inference Optimization
KV cache reuse across turnscompact workflows with fewer agent turnscompaction via deterministic summarization to limit context growth

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HumanEvalMBPPGSM8KMATHHotpotQADROPTravelPlannerShopping-MMLU

Risks & Boundaries

Limitations

Single-agent simulation assumes agents share the same base LLM; it cannot simulate true heterogeneity because KV caches are model-specific.

KV-cache cost estimates for closed APIs are simulated using final message lists; real API runtimes may differ.

When Not To Use

When agent roles require different base models with genuinely distinct capabilities.

When tool side-effects are nondeterministic and break the single-agent simulation assumptions.

Failure Modes

Context bloat and prompt interference from very long multi-turn histories.

Non-deterministic tools or external side-effects violate the simulation proof assumptions and can change behavior.

Core Entities

Models

GPT-4o-miniClaude-3.5-HaikuClaude-4.0-SonnetQwen-3-8B

Metrics

pass@1F1solve rate (%)Accuracytask success rate (%)inference cost (USD)

Datasets

HumanEvalMBPPGSM8KMATHHotpotQADROPTravelPlannerShopping-MMLU

Benchmarks

HumanEvalMBPPGSM8KMATHHotpotQADROPTravelPlannerShopping-MMLU