Fine-tune a small planning LLM on KG‑derived plans to improve retrieval-augmented QA

June 20, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun Chen

Links

Abstract / PDF

Why It Matters For Business

You can make cheaper, smaller LLMs better at multi-step, retrieval-based QA by generating plan labels from an existing knowledge graph and fine-tuning a compact planner; this improves answer accuracy without relying on large teacher models.

Summary TLDR

LPKG creates supervised planning data by grounding abstract query patterns in a knowledge graph (Wikidata15k), verbalizing those instances with an LLM, and using the resulting plan–question pairs to fine-tune a small 'planning' LLM. The fine-tuned planner outputs structured plans that are parsed and executed by a separate retriever + QA LLM pipeline. On several multi-hop and logical QA benchmarks (HotPotQA, 2WikiMQA, MuSiQue, Bamboogle) and a new KG-derived benchmark (CLQA-Wiki), LPKG improves exact-match and precision/recall compared to baselines and to distillation from a teacher LLM. The approach is practical: 9k KG-sourced training examples, LoRA fine-tuning of 7–8B models, and off-the-s

Problem Statement

Small to medium LLMs struggle to decompose complex questions for retrieval-augmented QA. Manual labeling is costly and teacher‑LLM distillation can be inaccurate. How can we cheaply generate supervised planning data and teach planning to smaller models so retrieval-augmented pipelines work better?

Main Contribution

LPKG: a pipeline that builds supervised planning data from KG patterns, verbalizes them with an LLM, and fine-tunes a dedicated planning LLM.

CLQA-Wiki: a new KG-derived benchmark (1,200 examples) that covers multi-hop, comparison, intersection, and union logic with multi-answer support.

Empirical evidence that KG-sourced planning data improves planning accuracy and downstream QA vs multiple baselines and vs normal distillation.

Key Findings

Fine-tuned planner (Llama3-8B) boosts Exact Match on HotPotQA versus ReAct

NumbersHotPotQA EM: 0.376 vs ReAct 0.211 (+0.165)

KG-sourced planning data outperforms normal distillation when training the same model

NumbersBamboogle EM: LPKG(CodeQwen)=0.28 vs DLPKG=0.216 (+0.064)

Small models gain the most from KG fine-tuning

NumbersCodeQwen HotPotQA EM: raw 0.11 → fine-tuned 0.338 (+0.228)

LPKG improves precision and recall on the KG-derived CLQA-Wiki benchmark

NumbersCLQA-Wiki P/R: 0.1112/0.1344 vs ICLPKG(GPT-3.5) 0.0907/0.1014 (+0.0205/+0.0330)

Retrieval remains the dominant failure source in end-to-end runs

NumbersError categories (40 samples): retrieval 17, planning 13, QA LLM 10

Results

Exact Match

Value0.376

BaselineReAct 0.211

Exact Match

Value0.372

BaselineReAct 0.216

Exact Match

Value0.28

BaselineDLPKG(CodeQwen) 0.216

Precision

Value0.1112

BaselineICLPKG(GPT-3.5) 0.0907

Recall

Value0.1344

BaselineICLPKG(GPT-3.5) 0.1014

Who Should Care

What To Try In 7 Days

Extract 500–1,000 grounded pattern instances from Wikidata and verbalize them with GPT-4 to create planning pairs.

LoRA-fine-tune a 7–8B model as a planner for 2 epochs and keep your QA model separate.

Plug the planner into your existing retriever (Contriever-MS or similar) and use top-5 docs for sub-question answering and set-list outputs for multi-answer tasks.

Agent Features

Memory

  • retrieval memory (Wikipedia documents as external context)

Planning

  • supervised planning via KG-derived plan examples
  • template-based plan structure with placeholders for multi-hop answers

Tool Use

  • external retriever (Contriever-MS)
  • external QA LLM calls for sub-question answering
  • set operations (Intersection/Union) executed programmatically

Frameworks

  • LPKG

Is Agentic

true

Architectures

  • decoupled planning LLM + QA LLM
  • code-formatted plan output for parsing and execution

Collaboration

  • separate models specialized for planning and QA to reduce task interference

Optimization Features

Token Efficiency

  • planning outputs are concise structured code to reduce parsing ambiguity

Infra Optimization

  • fine-tuning done on 4×80G A100s in ~3 hours for 7–8B models

Model Optimization

  • LoRA

System Optimization

  • decoupling planning/QA reduces need to run a single large model for both tasks

Training Optimization

  • LoRA
  • mixed pattern training (uniform mix across types)

Inference Optimization

  • code-formatted plans allow deterministic parsing and stepwise execution

Reproducibility

Data Urls

  • Wikidata15k (subset of Wikidata) — used for grounding patterns
  • CLQA-Wiki (constructed test set; referenced in paper and likely in repo)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training mixed all pattern types uniformly; impact of pattern distribution not explored.
  • Framework depends on KG coverage and quality; KGs limit question types to defined patterns.
  • End‑to‑end performance is sensitive to retriever quality (retrieval errors were the largest failure mode).

When Not To Use

  • When your domain cannot be mapped to KG patterns or lacks a suitable knowledge graph.
  • If you cannot run or tune a retriever (retrieval quality strongly affects results).
  • When real-time or low-latency constraints prevent running two-stage planner+QA pipelines.

Failure Modes

  • Planner misclassifies question type or emits wrong sub-questions (paper: 13/40 errors).
  • Retriever fails to return relevant documents (paper: 17/40 errors).
  • QA LLM returns incorrect answers from retrieved context (paper: 10/40 errors).

Core Entities

Models

  • CodeQwen1.5-7B-Chat
  • Llama3-8B-Instruct
  • gpt-3.5-turbo-1106 (GPT-3.5 for QA and baselines)

Metrics

  • Exact Match (EM)
  • Precision
  • Recall

Datasets

  • Wikidata15k (KG subset)
  • CLQA-Wiki (new, 1,200 examples)
  • HotPotQA
  • 2WikiMultiHopQA
  • MuSiQue
  • Bamboogle

Benchmarks

  • CLQA-Wiki
  • HotPotQA
  • 2WikiMQA
  • MuSiQue
  • Bamboogle