Fine-tune a small planning LLM on KG‑derived plans to improve retrieval-augmented QA

Overview

Decision SnapshotReady For Pilot

The paper demonstrates consistent gains across multiple benchmarks and an ablation vs distillation; results are empirical and tied to the retriever and the KG used, so expect moderate engineering work to reproduce.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can make cheaper, smaller LLMs better at multi-step, retrieval-based QA by generating plan labels from an existing knowledge graph and fine-tuning a compact planner; this improves answer accuracy without relying on large teacher models.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

LPKG creates supervised planning data by grounding abstract query patterns in a knowledge graph (Wikidata15k), verbalizing those instances with an LLM, and using the resulting plan–question pairs to fine-tune a small 'planning' LLM. The fine-tuned planner outputs structured plans that are parsed and executed by a separate retriever + QA LLM pipeline. On several multi-hop and logical QA benchmarks (HotPotQA, 2WikiMQA, MuSiQue, Bamboogle) and a new KG-derived benchmark (CLQA-Wiki), LPKG improves exact-match and precision/recall compared to baselines and to distillation from a teacher LLM. The approach is practical: 9k KG-sourced training examples, LoRA fine-tuning of 7–8B models, and off-the-s

Problem Statement

Small to medium LLMs struggle to decompose complex questions for retrieval-augmented QA. Manual labeling is costly and teacher‑LLM distillation can be inaccurate. How can we cheaply generate supervised planning data and teach planning to smaller models so retrieval-augmented pipelines work better?

Main Contribution

LPKG: a pipeline that builds supervised planning data from KG patterns, verbalizes them with an LLM, and fine-tunes a dedicated planning LLM.

CLQA-Wiki: a new KG-derived benchmark (1,200 examples) that covers multi-hop, comparison, intersection, and union logic with multi-answer support.

Key Findings

Fine-tuned planner (Llama3-8B) boosts Exact Match on HotPotQA versus ReAct

NumbersHotPotQA EM: 0.376 vs ReAct 0.211 (+0.165)

Practical UseFine-tune a small planning LLM on KG-derived plans to get large accuracy gains on multi-hop QA when using RAG.

Evidence RefTable 2

KG-sourced planning data outperforms normal distillation when training the same model

NumbersBamboogle EM: LPKG(CodeQwen)=0.28 vs DLPKG=0.216 (+0.064)

Practical UsePrefer constructing plan labels from KGs over teacher-LLM distillation for equal-size training budgets.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Exact Match	0.376	ReAct 0.211	+0.165	HotPotQA	Table 2: LPKG(Llama3) vs ReAct	Table 2
Exact Match	0.372	ReAct 0.216	+0.156	2WikiMQA	Table 2: LPKG(Llama3) vs ReAct	Table 2

What To Try In 7 Days

Extract 500–1,000 grounded pattern instances from Wikidata and verbalize them with GPT-4 to create planning pairs.

LoRA-fine-tune a 7–8B model as a planner for 2 epochs and keep your QA model separate.

Plug the planner into your existing retriever (Contriever-MS or similar) and use top-5 docs for sub-question answering and set-list outputs for multi-answer tasks.

Agent Features

Memory

retrieval memory (Wikipedia documents as external context)

Planning

supervised planning via KG-derived plan examplestemplate-based plan structure with placeholders for multi-hop answers

Tool Use

external retriever (Contriever-MS)external QA LLM calls for sub-question answeringset operations (Intersection/Union) executed programmatically

Frameworks

LPKG

Is Agentic

Yes

Architectures

decoupled planning LLM + QA LLMcode-formatted plan output for parsing and execution

Collaboration

separate models specialized for planning and QA to reduce task interference

Optimization Features

Token Efficiency

planning outputs are concise structured code to reduce parsing ambiguity

Infra Optimization

fine-tuning done on 4×80G A100s in ~3 hours for 7–8B models

Model Optimization

LoRA

System Optimization

decoupling planning/QA reduces need to run a single large model for both tasks

Training Optimization

LoRAmixed pattern training (uniform mix across types)

Inference Optimization

code-formatted plans allow deterministic parsing and stepwise execution

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/zjukg/LPKG

Data URLs

Wikidata15k (subset of Wikidata) — used for grounding patternsCLQA-Wiki (constructed test set; referenced in paper and likely in repo)

Risks & Boundaries

Limitations

Training mixed all pattern types uniformly; impact of pattern distribution not explored.

Framework depends on KG coverage and quality; KGs limit question types to defined patterns.

When Not To Use

When your domain cannot be mapped to KG patterns or lacks a suitable knowledge graph.

If you cannot run or tune a retriever (retrieval quality strongly affects results).

Failure Modes

Planner misclassifies question type or emits wrong sub-questions (paper: 13/40 errors).

Retriever fails to return relevant documents (paper: 17/40 errors).

Core Entities

Models

CodeQwen1.5-7B-ChatLlama3-8B-Instructgpt-3.5-turbo-1106 (GPT-3.5 for QA and baselines)

Metrics

Exact Match (EM)PrecisionRecall

Datasets

Wikidata15k (KG subset)CLQA-Wiki (new, 1,200 examples)HotPotQA2WikiMultiHopQAMuSiQueBamboogle

Benchmarks

CLQA-WikiHotPotQA2WikiMQAMuSiQueBamboogle

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuned planner (Llama3-8B) boosts Exact Match on HotPotQA versus ReAct

KG-sourced planning data outperforms normal distillation when training the same model

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding