Plan compute at inference time: reusable multi-agent modules + short/long-horizon planning to spend a fixed budget smarter.

December 12, 20257 min

Overview

Decision SnapshotNeeds Validation

Paper has thorough experiments on two agent benchmarks and ablations for planning components, but is limited to simulated self-play cost estimates and two benchmarks. Expect moderate engineering work to adapt to other domains.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 40%

Novelty: 70%

Authors

Dongwon Jung, Peng Shi, Yi Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

FutureWeaver helps you spend inference budget where it matters across cooperating agents, raising task success per dollar. It automates reusable multi-agent patterns and avoids leaving budget unused—useful for cost-sensitive production agents that combine search, browsing, and reasoning.

Who Should Care

Summary TLDR

FutureWeaver is a system that plans how to spend inference-time compute across multiple cooperating LLM agents under a fixed monetary/token budget. It (1) extracts reusable "collaboration modules" from self-play, (2) uses a dual-level planner (short-term self-consistency + long-term speculative feasibility) to pick actions, and (3) consistently raises accuracy on two agent benchmarks while using budget more effectively than baselines.

Problem Statement

Existing test-time scaling methods (more sampling, verification) help single LLMs but do not tell a multi-agent system how to split a fixed inference budget across agents and interactions. That causes underused budgets or wasted compute and poor coordination.

Main Contribution

Formalized budget-constrained test-time compute allocation for orchestrator-worker multi-agent systems.

Introduced collaboration modules: callable, reusable multi-agent workflows derived by LLM-based self-play reflection.

Key Findings

FUTUREWEAVER improves accuracy on GAIA with Claude models at low budget.

NumbersAcc@0.2: FUTUREWEAVER 38.89% vs ReAct 35.80% (+3.09 pp)

Practical UseIf you run Claude-family orchestrators on web/QA tasks with a small budget, replacing vanilla ReAct with FutureWeaver can yield a few percentage points higher accuracy.

Evidence RefTable 1 / Table 4

Gains grow at higher budgets on GAIA with Claude.

NumbersAcc@0.5: FUTUREWEAVER 48.15% vs ReAct 35.80% (+12.35 pp)

Practical UseWhen more budget is available, FutureWeaver's planning pays off more — plan for increased ROI as budget increases.

Evidence RefTable 1 / Table 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Acc@0.238.89%ReAct (no modules)+3.09 ppGAIA (Claude family)Table 1 / Table 4Table 1 / Table 4
Acc@0.548.15%ReAct (no modules)+12.35 ppGAIA (Claude family)Table 1 / Table 4Table 1 / Table 4

What To Try In 7 Days

Run self-play on a small validation set to collect token-cost stats for your agents (30 queries as in paper).

Implement simple collaboration modules (e.g., 'search then read', 'ensemble reasoning') as callable functions and plug them into your orchestrator.

Add a lightweight two-stage planner: score immediate proposals by self-consistency and perform cheap symbolic lookahead over modules to filter budget-infeasible plans.

Agent Features

Memory
self-play trajectory logs (cost and patterns)
Planning
short-horizon self-consistency scoringlong-horizon speculative rollouts (A*-like)
Tool Use
collaboration modules (function calls)self-play reflection to induce modules
Frameworks
FUTUREWEAVER
Is Agentic

Yes

Architectures
orchestrator-workerdual-level planning (short + long horizon)
Collaboration
ensemble modulesinteractive search/browse pipelinescritic-inserted refinement

Optimization Features

Token Efficiency
token-based monetary cost estimationaverage-cost per action from self-play
System Optimization
module reuse to amortize coordination costdual-level planning to avoid premature compute spending
Inference Optimization
budget-aware compute allocationspeculative long-horizon feasibility to steer spendingshort-term self-consistency to pick promising actions

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

GAIA (Mialon et al., 2023)BrowseComp-Plus (Chen et al., 2025b)

Risks & Boundaries

Limitations

Cost estimates are averaged across observed subtasks and assumed transferable — this may misestimate cost on new task distributions.

Collaboration modules are induced from self-play on a small validation set (30 queries); limited data may miss rarer but useful patterns.

When Not To Use

If you have no reliable validation trajectories to estimate action costs.

When per-query budgets are extremely tiny and only single-agent quick replies are feasible.

Failure Modes

Wrong cost estimates lead the long-horizon planner to prune useful plans or overspend early.

Self-play may induce modules that overfit validation workflows and fail on different tasks.

Core Entities

Models

Claude-3.7-SonnetClaude-3.5-HaikuQwen3-32B

Metrics

Acc@BAverage total cost (token $)

Datasets

GAIABrowseComp-Plus

Benchmarks

GAIABrowseComp-Plus