Fine-tune a small planning LLM on KG‑derived plans to improve retrieval-augmented QA

June 20, 20247 min

Overview

Decision SnapshotReady For Pilot

The paper demonstrates consistent gains across multiple benchmarks and an ablation vs distillation; results are empirical and tied to the retriever and the KG used, so expect moderate engineering work to reproduce.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can make cheaper, smaller LLMs better at multi-step, retrieval-based QA by generating plan labels from an existing knowledge graph and fine-tuning a compact planner; this improves answer accuracy without relying on large teacher models.

Who Should Care

Summary TLDR

LPKG creates supervised planning data by grounding abstract query patterns in a knowledge graph (Wikidata15k), verbalizing those instances with an LLM, and using the resulting plan–question pairs to fine-tune a small 'planning' LLM. The fine-tuned planner outputs structured plans that are parsed and executed by a separate retriever + QA LLM pipeline. On several multi-hop and logical QA benchmarks (HotPotQA, 2WikiMQA, MuSiQue, Bamboogle) and a new KG-derived benchmark (CLQA-Wiki), LPKG improves exact-match and precision/recall compared to baselines and to distillation from a teacher LLM. The approach is practical: 9k KG-sourced training examples, LoRA fine-tuning of 7–8B models, and off-the-s

Problem Statement

Small to medium LLMs struggle to decompose complex questions for retrieval-augmented QA. Manual labeling is costly and teacher‑LLM distillation can be inaccurate. How can we cheaply generate supervised planning data and teach planning to smaller models so retrieval-augmented pipelines work better?

Main Contribution

LPKG: a pipeline that builds supervised planning data from KG patterns, verbalizes them with an LLM, and fine-tunes a dedicated planning LLM.

CLQA-Wiki: a new KG-derived benchmark (1,200 examples) that covers multi-hop, comparison, intersection, and union logic with multi-answer support.

Key Findings

Fine-tuned planner (Llama3-8B) boosts Exact Match on HotPotQA versus ReAct

NumbersHotPotQA EM: 0.376 vs ReAct 0.211 (+0.165)

Practical UseFine-tune a small planning LLM on KG-derived plans to get large accuracy gains on multi-hop QA when using RAG.

Evidence RefTable 2

KG-sourced planning data outperforms normal distillation when training the same model

NumbersBamboogle EM: LPKG(CodeQwen)=0.28 vs DLPKG=0.216 (+0.064)

Practical UsePrefer constructing plan labels from KGs over teacher-LLM distillation for equal-size training budgets.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Exact Match0.376ReAct 0.211+0.165HotPotQATable 2: LPKG(Llama3) vs ReActTable 2
Exact Match0.372ReAct 0.216+0.1562WikiMQATable 2: LPKG(Llama3) vs ReActTable 2

What To Try In 7 Days

Extract 500–1,000 grounded pattern instances from Wikidata and verbalize them with GPT-4 to create planning pairs.

LoRA-fine-tune a 7–8B model as a planner for 2 epochs and keep your QA model separate.

Plug the planner into your existing retriever (Contriever-MS or similar) and use top-5 docs for sub-question answering and set-list outputs for multi-answer tasks.

Agent Features

Memory
retrieval memory (Wikipedia documents as external context)
Planning
supervised planning via KG-derived plan examplestemplate-based plan structure with placeholders for multi-hop answers
Tool Use
external retriever (Contriever-MS)external QA LLM calls for sub-question answeringset operations (Intersection/Union) executed programmatically
Frameworks
LPKG
Is Agentic

Yes

Architectures
decoupled planning LLM + QA LLMcode-formatted plan output for parsing and execution
Collaboration
separate models specialized for planning and QA to reduce task interference

Optimization Features

Token Efficiency
planning outputs are concise structured code to reduce parsing ambiguity
Infra Optimization
fine-tuning done on 4×80G A100s in ~3 hours for 7–8B models
Model Optimization
LoRA
System Optimization
decoupling planning/QA reduces need to run a single large model for both tasks
Training Optimization
LoRAmixed pattern training (uniform mix across types)
Inference Optimization
code-formatted plans allow deterministic parsing and stepwise execution

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Wikidata15k (subset of Wikidata) — used for grounding patternsCLQA-Wiki (constructed test set; referenced in paper and likely in repo)

Risks & Boundaries

Limitations

Training mixed all pattern types uniformly; impact of pattern distribution not explored.

Framework depends on KG coverage and quality; KGs limit question types to defined patterns.

When Not To Use

When your domain cannot be mapped to KG patterns or lacks a suitable knowledge graph.

If you cannot run or tune a retriever (retrieval quality strongly affects results).

Failure Modes

Planner misclassifies question type or emits wrong sub-questions (paper: 13/40 errors).

Retriever fails to return relevant documents (paper: 17/40 errors).

Core Entities

Models

CodeQwen1.5-7B-ChatLlama3-8B-Instructgpt-3.5-turbo-1106 (GPT-3.5 for QA and baselines)

Metrics

Exact Match (EM)PrecisionRecall

Datasets

Wikidata15k (KG subset)CLQA-Wiki (new, 1,200 examples)HotPotQA2WikiMultiHopQAMuSiQueBamboogle

Benchmarks

CLQA-WikiHotPotQA2WikiMQAMuSiQueBamboogle