Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.85
Citation Count
0
Why It Matters For Business
BAVT cuts expensive external tool calls and tokens by making step-level, budget-aware choices at inference; this often matches higher-budget accuracy and reduces API costs.
Summary TLDR
This paper introduces BAVT, a training-free inference framework that models multi-step agent reasoning as a dynamic search tree. A prompt-based critic scores step-level progress (residual value), and a budget-conditioned exponent shifts node sampling from wide exploration to greedy exploitation as resources run out. Evaluated on four multi-hop QA benchmarks with two LLM families, BAVT consistently improves Exact Match and F1 under strict token and tool-call budgets. Under tight budgets (5 tool calls) BAVT on a reasoning model matches or outperforms a baseline that uses 4× more resources. The method is practical (no fine-tuning), has an explicit budget backstop, and includes a theoretical PAC
Problem Statement
Current LLM agents assume abundant compute and waste tokens or costly tool calls on dead ends. Existing budget-aware fixes either need expensive fine-tuning or only adjust at the whole-trajectory level and cannot abandon failing paths mid-execution. The question: how to improve agent correctness under strict token and tool-call budgets by making step-level, budget-aware decisions at inference time?
Main Contribution
Budget-Aware Value Tree (BAVT): a training-free, inference-time tree search that uses a single LLM as both generator and prompt-based critic to guide multi-hop agent reasoning.
Residual step-level value critic: predicts marginal information gain (delta) to reliably prune uninformative or redundant actions and reduce overconfidence.
Budget-conditioned node selection: exponentiates node values by 1/(remaining budget ratio) to shift from exploration to exploitation as resources deplete, with a provable convergence guarantee.
Extensive evaluation on four multi-hop QA datasets and two model families showing superior performance-efficiency trade-offs, including beating baselines that use 4× more resources.
Key Findings
BAVT on OSS-20B with Low budget (5 tool calls) achieves higher Exact Match than the baseline running with High budget (20 calls).
Full BAVT (tree + step-value + budget-aware selection) raises average EM from baseline 0.268 to 0.388 on evaluated datasets.
Ablation shows step-value alone improves EM from 0.215 to 0.309; adding budget-aware selection pushes EM to 0.388.
External search queries dominate money costs; >90% of estimated per-sample cost is from tool calls.
Results
Exact Match (EM)
Exact Match (EM)
Tool-call cost fraction
Who Should Care
What To Try In 7 Days
Prototype BAVT prompts in your agent: add a step-level critic prompt that outputs a small delta score after each tool call.
Implement a simple budget ratio and amplify node values by 1/r to favor high-value branches as budget drops.
Measure cost per sample and rerun key workloads with and without step pruning to quantify tool-call savings.
Agent Features
Memory
- short-term context appended to nodes
Planning
- tree-structured planning
- budget-conditioned node selection
Tool Use
- retrieval/web search (external tool calls)
Frameworks
- Inspect AI
Is Agentic
true
Architectures
- single-LM actor-critic (generator + prompt critic)
Optimization Features
Token Efficiency
- step-level pruning to reduce tool calls and output tokens
- budget backstop to force deterministic final answer when resources near exhaustion
System Optimization
- single-LM generator/critic to avoid fine-tuning
- global backpropagation of values after first terminal answer
Inference Optimization
- test-time tree search
- budget-conditioned sampling exponent
Reproducibility
Data Urls
- HotpotQA (public)
- 2WikiMultihopQA (public)
- MuSiQue (public)
- Bamboogle (public)
- 2018 Wikipedia dump (used for retrieval)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Prompt-based critic causes extra inference overhead and consumes part of the token budget.
- Evaluations focus on web search as a single external tool with uniform cost; real deployments have heterogeneous, asymmetric tool costs.
- Scope limited to multi-hop QA; extension to long-horizon interactive tasks requires more sophisticated temporal credit assignment.
When Not To Use
- If external tools are extremely cheap and tool-call cost is negligible versus model latency.
- For irreversible, long-horizon control tasks without adapting the value function for delayed rewards.
- If you cannot afford the extra token overhead from the prompt-based critic and lack a lightweight value model.
Failure Modes
- Critic inference cost can offset savings when tasks are extremely cheap or trivial.
- Over-pruning: aggressive budget-driven exploitation may discard rare but correct exploratory paths if critic is miscalibrated.
- Instruct models hit a reasoning capacity ceiling; BAVT cannot overcome base-model limits once evidence volume exceeds model synthesis ability.
Core Entities
Models
- GPT-OSS-20B
- Qwen3-30B-A3B-Instruct-2507
- E5 (dense retriever)
Metrics
- Exact Match (EM)
- F1
Datasets
- HotpotQA
- 2WikiMultihopQA
- MuSiQue
- Bamboogle
- 2018 Wikipedia dump
Benchmarks
- multi-hop QA (HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle)

