Plan compute at inference time: reusable multi-agent modules + short/long-horizon planning to spend a fixed budget smarter.

Overview

Decision SnapshotNeeds Validation

Paper has thorough experiments on two agent benchmarks and ablations for planning components, but is limited to simulated self-play cost estimates and two benchmarks. Expect moderate engineering work to adapt to other domains.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 70%

Authors

Dongwon Jung, Peng Shi, Yi Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

FutureWeaver helps you spend inference budget where it matters across cooperating agents, raising task success per dollar. It automates reusable multi-agent patterns and avoids leaving budget unused—useful for cost-sensitive production agents that combine search, browsing, and reasoning.

Who Should Care

CTO Product Manager ML Engineer Founder

Summary TLDR

FutureWeaver is a system that plans how to spend inference-time compute across multiple cooperating LLM agents under a fixed monetary/token budget. It (1) extracts reusable "collaboration modules" from self-play, (2) uses a dual-level planner (short-term self-consistency + long-term speculative feasibility) to pick actions, and (3) consistently raises accuracy on two agent benchmarks while using budget more effectively than baselines.

Problem Statement

Existing test-time scaling methods (more sampling, verification) help single LLMs but do not tell a multi-agent system how to split a fixed inference budget across agents and interactions. That causes underused budgets or wasted compute and poor coordination.

Main Contribution

Formalized budget-constrained test-time compute allocation for orchestrator-worker multi-agent systems.

Introduced collaboration modules: callable, reusable multi-agent workflows derived by LLM-based self-play reflection.

Key Findings

FUTUREWEAVER improves accuracy on GAIA with Claude models at low budget.

NumbersAcc@0.2: FUTUREWEAVER 38.89% vs ReAct 35.80% (+3.09 pp)

Practical UseIf you run Claude-family orchestrators on web/QA tasks with a small budget, replacing vanilla ReAct with FutureWeaver can yield a few percentage points higher accuracy.

Evidence RefTable 1 / Table 4

Gains grow at higher budgets on GAIA with Claude.

NumbersAcc@0.5: FUTUREWEAVER 48.15% vs ReAct 35.80% (+12.35 pp)

Practical UseWhen more budget is available, FutureWeaver's planning pays off more — plan for increased ROI as budget increases.

Evidence RefTable 1 / Table 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Acc@0.2	38.89%	ReAct (no modules)	+3.09 pp	GAIA (Claude family)	Table 1 / Table 4	Table 1 / Table 4
Acc@0.5	48.15%	ReAct (no modules)	+12.35 pp	GAIA (Claude family)	Table 1 / Table 4	Table 1 / Table 4

What To Try In 7 Days

Run self-play on a small validation set to collect token-cost stats for your agents (30 queries as in paper).

Implement simple collaboration modules (e.g., 'search then read', 'ensemble reasoning') as callable functions and plug them into your orchestrator.

Add a lightweight two-stage planner: score immediate proposals by self-consistency and perform cheap symbolic lookahead over modules to filter budget-infeasible plans.

Agent Features

Memory

self-play trajectory logs (cost and patterns)

Planning

short-horizon self-consistency scoringlong-horizon speculative rollouts (A*-like)

Tool Use

collaboration modules (function calls)self-play reflection to induce modules

Frameworks

FUTUREWEAVER

Is Agentic

Yes

Architectures

orchestrator-workerdual-level planning (short + long horizon)

Collaboration

ensemble modulesinteractive search/browse pipelinescritic-inserted refinement

Optimization Features

Token Efficiency

token-based monetary cost estimationaverage-cost per action from self-play

System Optimization

module reuse to amortize coordination costdual-level planning to avoid premature compute spending

Inference Optimization

budget-aware compute allocationspeculative long-horizon feasibility to steer spendingshort-term self-consistency to pick promising actions

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

GAIA (Mialon et al., 2023)BrowseComp-Plus (Chen et al., 2025b)

Risks & Boundaries

Limitations

Cost estimates are averaged across observed subtasks and assumed transferable — this may misestimate cost on new task distributions.

Collaboration modules are induced from self-play on a small validation set (30 queries); limited data may miss rarer but useful patterns.

When Not To Use

If you have no reliable validation trajectories to estimate action costs.

When per-query budgets are extremely tiny and only single-agent quick replies are feasible.

Failure Modes

Wrong cost estimates lead the long-horizon planner to prune useful plans or overspend early.

Self-play may induce modules that overfit validation workflows and fail on different tasks.

Core Entities

Models

Claude-3.7-SonnetClaude-3.5-HaikuQwen3-32B

Metrics

Acc@BAverage total cost (token $)

Datasets

GAIABrowseComp-Plus

Benchmarks

GAIABrowseComp-Plus

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FUTUREWEAVER improves accuracy on GAIA with Claude models at low budget.

Gains grow at higher budgets on GAIA with Claude.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding