Use an agent + MCTS to explore a knowledge base, auto-label data, and cut labeled-data needs for KBQA

January 31, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.75

Cost Impact Score

0.5

Citation Count

1

Authors

Haoran Luo, Haihong E, Yikai Guo, Qika Lin, Xiaobao Wu, Xinyu Mu, Wenhao Liu, Meina Song, Yifan Zhu, Luu Anh Tuan

Links

Abstract / PDF

Why It Matters For Business

If you must run KBQA with limited labeled data, KBQA-o1 lets you use open LLMs and automated MCTS exploration to generate high-quality training pairs and sharply improve accuracy on complex queries.

Summary TLDR

KBQA-o1 wraps a ReAct-style agent (stepwise Thought-Action-Observation) around a knowledge base and uses Monte Carlo Tree Search (MCTS) guided by a policy and a reward model to search logical-form space. With a small supervised seed and MCTS exploration to auto-annotate unlabeled questions, KBQA-o1 substantially improves low-resource KBQA performance (e.g., GrailQA F1 78.5% with Llama-3.1-8B vs prior low-resource 48.5% with GPT-3.5). The method is plug-and-play with open LLMs and uses eight atomic KB query tools to produce executable logical forms.

Problem Statement

KBQA models either (1) generate logical forms end-to-end and miss KB structure and unseen relations, or (2) reason step-by-step but get stuck in local optima or blow up the search space. Both approaches also need large annotated datasets. The paper asks: can an agentic search (ReAct + MCTS) that interacts with the KB and auto-labels high-quality trajectories reduce annotation needs and improve accuracy under low-resource settings?

Main Contribution

KBQA-o1: a ReAct-style agent that interacts step-by-step with the KB using eight atomic query tools to build executable logical forms.

MCTS-driven heuristic exploration guided by a small supervised policy and a reward model to balance exploration and search cost.

An incremental fine-tuning loop: use MCTS to auto-annotate high-reward unlabeled questions, then fine-tune policy and reward models to improve low-resource KBQA.

Extensive low-resource experiments (GrailQA, WebQSP, GraphQ) showing large gains over prior low-resource methods and ablation studies that quantify each component.

Key Findings

KBQA-o1 dramatically improves low-resource GrailQA accuracy versus prior methods.

NumbersGrailQA F1 78.5% (Llama-3.1-8B) vs 48.5% prior low-resource best (GPT-3.5-turbo)

Each core module matters: removing incremental fine-tuning or the agent/K B feedback sharply reduces performance.

NumbersFull GrailQA F1 78.5% → w/o incremental fine-tuning F1 63.9% (≈14.6 pp drop)

Auto-annotation quality depends on reward threshold; tuning it trades quantity for quality.

NumbersGrailQA best F1 found with reward threshold γ* = -100 (experiment sweep -200..100)

Primary error types show where to invest effort to improve the system.

NumbersCorrect path not discovered 54.7%; executable path not discovered 29.8%; correct path not selected 15.5%

Results

GrailQA F1 (low-resource)

Value78.5% ±1.0 (Llama-3.1-8B, 40-shot)

Baseline48.5% (prior low-resource best, GPT-3.5-turbo)

GrailQA EM (low-resource)

Value77.8% ±0.5 (Llama-3.1-8B, 40-shot)

Baseline≈49.7% prior (implied from prior low-resource results)

WebQSP F1 (low-resource)

Value59.8% ±1.2 (Llama-3.1-8B, 100-shot)

Baseline52.6% (KB-BINDER, GPT-3.5-turbo)

GraphQ F1 (low-resource)

Value48.7% ±0.8 (Llama-3.1-8B, 100-shot)

Baseline31.1% (KB-Coder, GPT-3.5-turbo)

Who Should Care

What To Try In 7 Days

Implement a ReAct-style prompt that alternates Thought/Action/Observation for your KB queries.

Train small policy and reward SFT models on a few dozen labeled examples.

Run MCTS rollouts (few hundred per question) using a high exploration weight to auto-annotate unlabeled questions and filter by reward score (tune γ*).

Agent Features

Memory

  • Short-term prompt history stores state trajectory as Thought-Action-Observation sequence
  • State-to-logical-form conversion keeps cumulative observations

Planning

  • MCTS rollouts to expand and score multi-step plans
  • LoRA

Tool Use

  • Eight atomic KB query tools (Extract entity, Find relation, Merge, Order, Compare, Time constraint,
  • SimCSE for semantic matching of generated actions to KB-executable options

Frameworks

  • SFT
  • SimCSE retrieval for action filtering

Is Agentic

true

Architectures

  • ReAct-style agent (Thought-Action-Observation)
  • MCTS tree search with policy and reward models

Optimization Features

Token Efficiency

  • Small beam sizes and limited rollout counts (6 rollouts during prediction) limit token use

Infra Optimization

  • Experiments run on 8x NVIDIA A40 GPUs (48GB) as reported

Model Optimization

  • Fine-tune policy and reward models with small supervised seed
  • SFT

System Optimization

  • Filter policy-generated candidates via SimCSE to only actions executable on KB
  • Balance policy score and reward model via reward ratio δ (0.5)

Training Optimization

  • SFT

Inference Optimization

  • Accuracy
  • Beam search to propose candidate actions (beam sizes 1–3)

Reproducibility

Data Urls

  • GrailQA (public), WebQSP (public), GraphQ (public), Freebase (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • MCTS adds compute and latency; rollout cost scales with rollouts and depth (O(N·k·L)).
  • Scalability to very large KBs or real-time systems is not demonstrated and may need pruning or stricter action filtering.
  • Reward model and threshold require dataset-specific tuning (γ* differs per dataset).
  • Policy/reward SFT and DoRA details may need adaptation for proprietary LLM pipelines.

When Not To Use

  • When you need strict low-latency, real-time single-query responses without heavy search.
  • When you already have abundant high-quality labeled logical-form data and prefer simpler end-to-end models.
  • When compute budget cannot support MCTS rollouts and repeated fine-tuning.

Failure Modes

  • Executable path not discovered (29.8%): agent fails to find any KB-executable path for a question.
  • Correct path not discovered (54.7%): exploration finds paths but not the correct one, often due to retrieval or unseen relations.
  • Correct path not selected (15.5%): correct path found but reward/model ranking picks a different path.

Core Entities

Models

  • Llama-3.1-8B
  • Llama-3.3-70B
  • Qwen2.5-7B
  • Qwen2.5-72B
  • Gemma-2-27B

Metrics

  • F1
  • Exact Match (EM)

Datasets

  • GrailQA
  • WebQSP
  • GraphQ
  • Freebase

Context Entities

Models

  • GPT-3.5-turbo (baseline comparisons)