BAVT: a training-free tree search that spends fewer tokens and tool calls to match or beat brute-force scaling

March 13, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.85

Citation Count

0

Authors

Yushu Li, Wenlong Deng, Jiajin Li, Xiaoxiao Li

Links

Abstract / PDF

Why It Matters For Business

BAVT cuts expensive external tool calls and tokens by making step-level, budget-aware choices at inference; this often matches higher-budget accuracy and reduces API costs.

Summary TLDR

This paper introduces BAVT, a training-free inference framework that models multi-step agent reasoning as a dynamic search tree. A prompt-based critic scores step-level progress (residual value), and a budget-conditioned exponent shifts node sampling from wide exploration to greedy exploitation as resources run out. Evaluated on four multi-hop QA benchmarks with two LLM families, BAVT consistently improves Exact Match and F1 under strict token and tool-call budgets. Under tight budgets (5 tool calls) BAVT on a reasoning model matches or outperforms a baseline that uses 4× more resources. The method is practical (no fine-tuning), has an explicit budget backstop, and includes a theoretical PAC

Problem Statement

Current LLM agents assume abundant compute and waste tokens or costly tool calls on dead ends. Existing budget-aware fixes either need expensive fine-tuning or only adjust at the whole-trajectory level and cannot abandon failing paths mid-execution. The question: how to improve agent correctness under strict token and tool-call budgets by making step-level, budget-aware decisions at inference time?

Main Contribution

Budget-Aware Value Tree (BAVT): a training-free, inference-time tree search that uses a single LLM as both generator and prompt-based critic to guide multi-hop agent reasoning.

Residual step-level value critic: predicts marginal information gain (delta) to reliably prune uninformative or redundant actions and reduce overconfidence.

Budget-conditioned node selection: exponentiates node values by 1/(remaining budget ratio) to shift from exploration to exploitation as resources deplete, with a provable convergence guarantee.

Extensive evaluation on four multi-hop QA datasets and two model families showing superior performance-efficiency trade-offs, including beating baselines that use 4× more resources.

Key Findings

BAVT on OSS-20B with Low budget (5 tool calls) achieves higher Exact Match than the baseline running with High budget (20 calls).

NumbersOSS-20B Low (BAVT) EM 0.338 vs baseline High EM 0.334

Full BAVT (tree + step-value + budget-aware selection) raises average EM from baseline 0.268 to 0.388 on evaluated datasets.

NumbersBaseline AVG EM 0.268 → BAVT AVG EM 0.388

Ablation shows step-value alone improves EM from 0.215 to 0.309; adding budget-aware selection pushes EM to 0.388.

NumbersTree-only 0.215 → +Value 0.309 → +Budget 0.388 (AVG EM)

External search queries dominate money costs; >90% of estimated per-sample cost is from tool calls.

NumbersTable 3: search cost accounts for over 90% of total estimated cost per sample

Results

Exact Match (EM)

ValueBAVT OSS-20B Low: 0.338

BaselineParallel sampling baseline OSS-20B High: 0.334

Exact Match (EM)

ValueBAVT AVG EM 0.388

BaselineBaseline AVG EM 0.268

Tool-call cost fraction

Value>90%

Who Should Care

What To Try In 7 Days

Prototype BAVT prompts in your agent: add a step-level critic prompt that outputs a small delta score after each tool call.

Implement a simple budget ratio and amplify node values by 1/r to favor high-value branches as budget drops.

Measure cost per sample and rerun key workloads with and without step pruning to quantify tool-call savings.

Agent Features

Memory

  • short-term context appended to nodes

Planning

  • tree-structured planning
  • budget-conditioned node selection

Tool Use

  • retrieval/web search (external tool calls)

Frameworks

  • Inspect AI

Is Agentic

true

Architectures

  • single-LM actor-critic (generator + prompt critic)

Optimization Features

Token Efficiency

  • step-level pruning to reduce tool calls and output tokens
  • budget backstop to force deterministic final answer when resources near exhaustion

System Optimization

  • single-LM generator/critic to avoid fine-tuning
  • global backpropagation of values after first terminal answer

Inference Optimization

  • test-time tree search
  • budget-conditioned sampling exponent

Reproducibility

Data Urls

  • HotpotQA (public)
  • 2WikiMultihopQA (public)
  • MuSiQue (public)
  • Bamboogle (public)
  • 2018 Wikipedia dump (used for retrieval)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Prompt-based critic causes extra inference overhead and consumes part of the token budget.
  • Evaluations focus on web search as a single external tool with uniform cost; real deployments have heterogeneous, asymmetric tool costs.
  • Scope limited to multi-hop QA; extension to long-horizon interactive tasks requires more sophisticated temporal credit assignment.

When Not To Use

  • If external tools are extremely cheap and tool-call cost is negligible versus model latency.
  • For irreversible, long-horizon control tasks without adapting the value function for delayed rewards.
  • If you cannot afford the extra token overhead from the prompt-based critic and lack a lightweight value model.

Failure Modes

  • Critic inference cost can offset savings when tasks are extremely cheap or trivial.
  • Over-pruning: aggressive budget-driven exploitation may discard rare but correct exploratory paths if critic is miscalibrated.
  • Instruct models hit a reasoning capacity ceiling; BAVT cannot overcome base-model limits once evidence volume exceeds model synthesis ability.

Core Entities

Models

  • GPT-OSS-20B
  • Qwen3-30B-A3B-Instruct-2507
  • E5 (dense retriever)

Metrics

  • Exact Match (EM)
  • F1

Datasets

  • HotpotQA
  • 2WikiMultihopQA
  • MuSiQue
  • Bamboogle
  • 2018 Wikipedia dump

Benchmarks

  • multi-hop QA (HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle)