Overview
The paper demonstrates consistent gains across multiple benchmarks and an ablation vs distillation; results are empirical and tied to the retriever and the KG used, so expect moderate engineering work to reproduce.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
You can make cheaper, smaller LLMs better at multi-step, retrieval-based QA by generating plan labels from an existing knowledge graph and fine-tuning a compact planner; this improves answer accuracy without relying on large teacher models.
Who Should Care
Summary TLDR
LPKG creates supervised planning data by grounding abstract query patterns in a knowledge graph (Wikidata15k), verbalizing those instances with an LLM, and using the resulting plan–question pairs to fine-tune a small 'planning' LLM. The fine-tuned planner outputs structured plans that are parsed and executed by a separate retriever + QA LLM pipeline. On several multi-hop and logical QA benchmarks (HotPotQA, 2WikiMQA, MuSiQue, Bamboogle) and a new KG-derived benchmark (CLQA-Wiki), LPKG improves exact-match and precision/recall compared to baselines and to distillation from a teacher LLM. The approach is practical: 9k KG-sourced training examples, LoRA fine-tuning of 7–8B models, and off-the-s
Problem Statement
Small to medium LLMs struggle to decompose complex questions for retrieval-augmented QA. Manual labeling is costly and teacher‑LLM distillation can be inaccurate. How can we cheaply generate supervised planning data and teach planning to smaller models so retrieval-augmented pipelines work better?
Main Contribution
LPKG: a pipeline that builds supervised planning data from KG patterns, verbalizes them with an LLM, and fine-tunes a dedicated planning LLM.
CLQA-Wiki: a new KG-derived benchmark (1,200 examples) that covers multi-hop, comparison, intersection, and union logic with multi-answer support.
Key Findings
Fine-tuned planner (Llama3-8B) boosts Exact Match on HotPotQA versus ReAct
KG-sourced planning data outperforms normal distillation when training the same model
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Exact Match | 0.376 | ReAct 0.211 | +0.165 | HotPotQA | Table 2: LPKG(Llama3) vs ReAct | Table 2 |
| Exact Match | 0.372 | ReAct 0.216 | +0.156 | 2WikiMQA | Table 2: LPKG(Llama3) vs ReAct | Table 2 |
What To Try In 7 Days
Extract 500–1,000 grounded pattern instances from Wikidata and verbalize them with GPT-4 to create planning pairs.
LoRA-fine-tune a 7–8B model as a planner for 2 epochs and keep your QA model separate.
Plug the planner into your existing retriever (Contriever-MS or similar) and use top-5 docs for sub-question answering and set-list outputs for multi-answer tasks.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Training mixed all pattern types uniformly; impact of pattern distribution not explored.
Framework depends on KG coverage and quality; KGs limit question types to defined patterns.
When Not To Use
When your domain cannot be mapped to KG patterns or lacks a suitable knowledge graph.
If you cannot run or tune a retriever (retrieval quality strongly affects results).
Failure Modes
Planner misclassifies question type or emits wrong sub-questions (paper: 13/40 errors).
Retriever fails to return relevant documents (paper: 17/40 errors).

