Overview
Production Readiness
0.7
Novelty Score
0.75
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
If you must run KBQA with limited labeled data, KBQA-o1 lets you use open LLMs and automated MCTS exploration to generate high-quality training pairs and sharply improve accuracy on complex queries.
Summary TLDR
KBQA-o1 wraps a ReAct-style agent (stepwise Thought-Action-Observation) around a knowledge base and uses Monte Carlo Tree Search (MCTS) guided by a policy and a reward model to search logical-form space. With a small supervised seed and MCTS exploration to auto-annotate unlabeled questions, KBQA-o1 substantially improves low-resource KBQA performance (e.g., GrailQA F1 78.5% with Llama-3.1-8B vs prior low-resource 48.5% with GPT-3.5). The method is plug-and-play with open LLMs and uses eight atomic KB query tools to produce executable logical forms.
Problem Statement
KBQA models either (1) generate logical forms end-to-end and miss KB structure and unseen relations, or (2) reason step-by-step but get stuck in local optima or blow up the search space. Both approaches also need large annotated datasets. The paper asks: can an agentic search (ReAct + MCTS) that interacts with the KB and auto-labels high-quality trajectories reduce annotation needs and improve accuracy under low-resource settings?
Main Contribution
KBQA-o1: a ReAct-style agent that interacts step-by-step with the KB using eight atomic query tools to build executable logical forms.
MCTS-driven heuristic exploration guided by a small supervised policy and a reward model to balance exploration and search cost.
An incremental fine-tuning loop: use MCTS to auto-annotate high-reward unlabeled questions, then fine-tune policy and reward models to improve low-resource KBQA.
Extensive low-resource experiments (GrailQA, WebQSP, GraphQ) showing large gains over prior low-resource methods and ablation studies that quantify each component.
Key Findings
KBQA-o1 dramatically improves low-resource GrailQA accuracy versus prior methods.
Each core module matters: removing incremental fine-tuning or the agent/K B feedback sharply reduces performance.
Auto-annotation quality depends on reward threshold; tuning it trades quantity for quality.
Primary error types show where to invest effort to improve the system.
Results
GrailQA F1 (low-resource)
GrailQA EM (low-resource)
WebQSP F1 (low-resource)
GraphQ F1 (low-resource)
Who Should Care
What To Try In 7 Days
Implement a ReAct-style prompt that alternates Thought/Action/Observation for your KB queries.
Train small policy and reward SFT models on a few dozen labeled examples.
Run MCTS rollouts (few hundred per question) using a high exploration weight to auto-annotate unlabeled questions and filter by reward score (tune γ*).
Agent Features
Memory
- Short-term prompt history stores state trajectory as Thought-Action-Observation sequence
- State-to-logical-form conversion keeps cumulative observations
Planning
- MCTS rollouts to expand and score multi-step plans
- LoRA
Tool Use
- Eight atomic KB query tools (Extract entity, Find relation, Merge, Order, Compare, Time constraint,
- SimCSE for semantic matching of generated actions to KB-executable options
Frameworks
- SFT
- SimCSE retrieval for action filtering
Is Agentic
true
Architectures
- ReAct-style agent (Thought-Action-Observation)
- MCTS tree search with policy and reward models
Optimization Features
Token Efficiency
- Small beam sizes and limited rollout counts (6 rollouts during prediction) limit token use
Infra Optimization
- Experiments run on 8x NVIDIA A40 GPUs (48GB) as reported
Model Optimization
- Fine-tune policy and reward models with small supervised seed
- SFT
System Optimization
- Filter policy-generated candidates via SimCSE to only actions executable on KB
- Balance policy score and reward model via reward ratio δ (0.5)
Training Optimization
- SFT
Inference Optimization
- Accuracy
- Beam search to propose candidate actions (beam sizes 1–3)
Reproducibility
Code Urls
Data Urls
- GrailQA (public), WebQSP (public), GraphQ (public), Freebase (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- MCTS adds compute and latency; rollout cost scales with rollouts and depth (O(N·k·L)).
- Scalability to very large KBs or real-time systems is not demonstrated and may need pruning or stricter action filtering.
- Reward model and threshold require dataset-specific tuning (γ* differs per dataset).
- Policy/reward SFT and DoRA details may need adaptation for proprietary LLM pipelines.
When Not To Use
- When you need strict low-latency, real-time single-query responses without heavy search.
- When you already have abundant high-quality labeled logical-form data and prefer simpler end-to-end models.
- When compute budget cannot support MCTS rollouts and repeated fine-tuning.
Failure Modes
- Executable path not discovered (29.8%): agent fails to find any KB-executable path for a question.
- Correct path not discovered (54.7%): exploration finds paths but not the correct one, often due to retrieval or unseen relations.
- Correct path not selected (15.5%): correct path found but reward/model ranking picks a different path.
Core Entities
Models
- Llama-3.1-8B
- Llama-3.3-70B
- Qwen2.5-7B
- Qwen2.5-72B
- Gemma-2-27B
Metrics
- F1
- Exact Match (EM)
Datasets
- GrailQA
- WebQSP
- GraphQ
- Freebase
Context Entities
Models
- GPT-3.5-turbo (baseline comparisons)

