Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
You can make cheaper, smaller LLMs better at multi-step, retrieval-based QA by generating plan labels from an existing knowledge graph and fine-tuning a compact planner; this improves answer accuracy without relying on large teacher models.
Summary TLDR
LPKG creates supervised planning data by grounding abstract query patterns in a knowledge graph (Wikidata15k), verbalizing those instances with an LLM, and using the resulting plan–question pairs to fine-tune a small 'planning' LLM. The fine-tuned planner outputs structured plans that are parsed and executed by a separate retriever + QA LLM pipeline. On several multi-hop and logical QA benchmarks (HotPotQA, 2WikiMQA, MuSiQue, Bamboogle) and a new KG-derived benchmark (CLQA-Wiki), LPKG improves exact-match and precision/recall compared to baselines and to distillation from a teacher LLM. The approach is practical: 9k KG-sourced training examples, LoRA fine-tuning of 7–8B models, and off-the-s
Problem Statement
Small to medium LLMs struggle to decompose complex questions for retrieval-augmented QA. Manual labeling is costly and teacher‑LLM distillation can be inaccurate. How can we cheaply generate supervised planning data and teach planning to smaller models so retrieval-augmented pipelines work better?
Main Contribution
LPKG: a pipeline that builds supervised planning data from KG patterns, verbalizes them with an LLM, and fine-tunes a dedicated planning LLM.
CLQA-Wiki: a new KG-derived benchmark (1,200 examples) that covers multi-hop, comparison, intersection, and union logic with multi-answer support.
Empirical evidence that KG-sourced planning data improves planning accuracy and downstream QA vs multiple baselines and vs normal distillation.
Key Findings
Fine-tuned planner (Llama3-8B) boosts Exact Match on HotPotQA versus ReAct
KG-sourced planning data outperforms normal distillation when training the same model
Small models gain the most from KG fine-tuning
LPKG improves precision and recall on the KG-derived CLQA-Wiki benchmark
Retrieval remains the dominant failure source in end-to-end runs
Results
Exact Match
Exact Match
Exact Match
Precision
Recall
Who Should Care
What To Try In 7 Days
Extract 500–1,000 grounded pattern instances from Wikidata and verbalize them with GPT-4 to create planning pairs.
LoRA-fine-tune a 7–8B model as a planner for 2 epochs and keep your QA model separate.
Plug the planner into your existing retriever (Contriever-MS or similar) and use top-5 docs for sub-question answering and set-list outputs for multi-answer tasks.
Agent Features
Memory
- retrieval memory (Wikipedia documents as external context)
Planning
- supervised planning via KG-derived plan examples
- template-based plan structure with placeholders for multi-hop answers
Tool Use
- external retriever (Contriever-MS)
- external QA LLM calls for sub-question answering
- set operations (Intersection/Union) executed programmatically
Frameworks
- LPKG
Is Agentic
true
Architectures
- decoupled planning LLM + QA LLM
- code-formatted plan output for parsing and execution
Collaboration
- separate models specialized for planning and QA to reduce task interference
Optimization Features
Token Efficiency
- planning outputs are concise structured code to reduce parsing ambiguity
Infra Optimization
- fine-tuning done on 4×80G A100s in ~3 hours for 7–8B models
Model Optimization
- LoRA
System Optimization
- decoupling planning/QA reduces need to run a single large model for both tasks
Training Optimization
- LoRA
- mixed pattern training (uniform mix across types)
Inference Optimization
- code-formatted plans allow deterministic parsing and stepwise execution
Reproducibility
Code Urls
Data Urls
- Wikidata15k (subset of Wikidata) — used for grounding patterns
- CLQA-Wiki (constructed test set; referenced in paper and likely in repo)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training mixed all pattern types uniformly; impact of pattern distribution not explored.
- Framework depends on KG coverage and quality; KGs limit question types to defined patterns.
- End‑to‑end performance is sensitive to retriever quality (retrieval errors were the largest failure mode).
When Not To Use
- When your domain cannot be mapped to KG patterns or lacks a suitable knowledge graph.
- If you cannot run or tune a retriever (retrieval quality strongly affects results).
- When real-time or low-latency constraints prevent running two-stage planner+QA pipelines.
Failure Modes
- Planner misclassifies question type or emits wrong sub-questions (paper: 13/40 errors).
- Retriever fails to return relevant documents (paper: 17/40 errors).
- QA LLM returns incorrect answers from retrieved context (paper: 10/40 errors).
Core Entities
Models
- CodeQwen1.5-7B-Chat
- Llama3-8B-Instruct
- gpt-3.5-turbo-1106 (GPT-3.5 for QA and baselines)
Metrics
- Exact Match (EM)
- Precision
- Recall
Datasets
- Wikidata15k (KG subset)
- CLQA-Wiki (new, 1,200 examples)
- HotPotQA
- 2WikiMultiHopQA
- MuSiQue
- Bamboogle
Benchmarks
- CLQA-Wiki
- HotPotQA
- 2WikiMQA
- MuSiQue
- Bamboogle

