BAVT: a training-free tree search that spends fewer tokens and tool calls to match or beat brute-force scaling

March 13, 20268 min

Overview

Decision SnapshotReady For Pilot

BAVT is practical now for multi-hop QA agents: it avoids fine-tuning, includes reproducible prompts and hyperparameters, and shows consistent gains across models and budgets.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 85%

Production readiness: 70%

Novelty: 65%

Authors

Yushu Li, Wenlong Deng, Jiajin Li, Xiaoxiao Li

Links

Abstract / PDF / Data

Why It Matters For Business

BAVT cuts expensive external tool calls and tokens by making step-level, budget-aware choices at inference; this often matches higher-budget accuracy and reduces API costs.

Who Should Care

Summary TLDR

This paper introduces BAVT, a training-free inference framework that models multi-step agent reasoning as a dynamic search tree. A prompt-based critic scores step-level progress (residual value), and a budget-conditioned exponent shifts node sampling from wide exploration to greedy exploitation as resources run out. Evaluated on four multi-hop QA benchmarks with two LLM families, BAVT consistently improves Exact Match and F1 under strict token and tool-call budgets. Under tight budgets (5 tool calls) BAVT on a reasoning model matches or outperforms a baseline that uses 4× more resources. The method is practical (no fine-tuning), has an explicit budget backstop, and includes a theoretical PAC

Problem Statement

Current LLM agents assume abundant compute and waste tokens or costly tool calls on dead ends. Existing budget-aware fixes either need expensive fine-tuning or only adjust at the whole-trajectory level and cannot abandon failing paths mid-execution. The question: how to improve agent correctness under strict token and tool-call budgets by making step-level, budget-aware decisions at inference time?

Main Contribution

Budget-Aware Value Tree (BAVT): a training-free, inference-time tree search that uses a single LLM as both generator and prompt-based critic to guide multi-hop agent reasoning.

Residual step-level value critic: predicts marginal information gain (delta) to reliably prune uninformative or redundant actions and reduce overconfidence.

Key Findings

BAVT on OSS-20B with Low budget (5 tool calls) achieves higher Exact Match than the baseline running with High budget (20 calls).

NumbersOSS-20B Low (BAVT) EM 0.338 vs baseline High EM 0.334

Practical UseUse value-guided tree search to reduce tool calls and tokens: you can often match high-budget accuracy with a quarter of the resources.

Evidence RefSection 4.2, Figure 3; main results

Full BAVT (tree + step-value + budget-aware selection) raises average EM from baseline 0.268 to 0.388 on evaluated datasets.

NumbersBaseline AVG EM 0.268 → BAVT AVG EM 0.388

Practical UseCombine step-level verification and budget-aware sampling—both are needed to get meaningful accuracy gains under tight budgets.

Evidence RefSection 4.3 Ablation Table 1 (AVG EM row)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exact Match (EM)BAVT OSS-20B Low: 0.338Parallel sampling baseline OSS-20B High: 0.334+0.004 vs baseline highAverage across four multi-hop QA benchmarksSection 4.2 main results; Figure 3Figure 3, Section 4.2
Exact Match (EM)BAVT AVG EM 0.388Baseline AVG EM 0.268+0.120 absolute (avg)Ablation average (OSS-20B, Middle budget)Section 4.3 Ablation Table 1Table 1

What To Try In 7 Days

Prototype BAVT prompts in your agent: add a step-level critic prompt that outputs a small delta score after each tool call.

Implement a simple budget ratio and amplify node values by 1/r to favor high-value branches as budget drops.

Measure cost per sample and rerun key workloads with and without step pruning to quantify tool-call savings.

Agent Features

Memory
short-term context appended to nodes
Planning
tree-structured planningbudget-conditioned node selection
Tool Use
retrieval/web search (external tool calls)
Frameworks
Inspect AI
Is Agentic

Yes

Architectures
single-LM actor-critic (generator + prompt critic)

Optimization Features

Token Efficiency
step-level pruning to reduce tool calls and output tokensbudget backstop to force deterministic final answer when resources near exhaustion
System Optimization
single-LM generator/critic to avoid fine-tuningglobal backpropagation of values after first terminal answer
Inference Optimization
test-time tree searchbudget-conditioned sampling exponent

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

HotpotQA (public)2WikiMultihopQA (public)MuSiQue (public)Bamboogle (public)2018 Wikipedia dump (used for retrieval)

Risks & Boundaries

Limitations

Prompt-based critic causes extra inference overhead and consumes part of the token budget.

Evaluations focus on web search as a single external tool with uniform cost; real deployments have heterogeneous, asymmetric tool costs.

When Not To Use

If external tools are extremely cheap and tool-call cost is negligible versus model latency.

For irreversible, long-horizon control tasks without adapting the value function for delayed rewards.

Failure Modes

Critic inference cost can offset savings when tasks are extremely cheap or trivial.

Over-pruning: aggressive budget-driven exploitation may discard rare but correct exploratory paths if critic is miscalibrated.

Core Entities

Models

GPT-OSS-20BQwen3-30B-A3B-Instruct-2507E5 (dense retriever)

Metrics

Exact Match (EM)F1

Datasets

HotpotQA2WikiMultihopQAMuSiQueBamboogle2018 Wikipedia dump

Benchmarks

multi-hop QA (HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle)