Use an LLM to break sentences, pull subgraphs, and reason over knowledge graphs

October 17, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.3

Citation Count

3

Authors

Jiho Kim, Yeonsu Kwon, Yohan Jo, Edward Choi

Links

Abstract / PDF

Why It Matters For Business

KG-GPT lets you add structured KG reasoning to LLM pipelines with little labeled data. Use it to prototype fact verification or KGQA systems quickly before investing in custom supervised retrievers.

Summary TLDR

KG-GPT is a three-step framework that uses large language models (LLMs) to do knowledge-graph tasks without full supervision. It first splits a claim or question into sub-sentences, then uses the LLM to pick candidate relations and retrieve a sub-graph, and finally asks the LLM to infer the answer from the linearized triples. In few-shot tests on FACTKG (fact verification) and MetaQA (KGQA), KG-GPT is competitive with many supervised baselines: it scores ~72.7% accuracy on FACTKG and ~96/94/94% Hits@1 on MetaQA 1/2/3-hop. The method is robust across hops but depends strongly on prompt examples and on correct sentence segmentation.

Problem Statement

LLMs excel at free text but are underused for structured reasoning on knowledge graphs. There is no general, few-shot framework that (1) maps natural sentences to KG relations, (2) retrieves an evidence subgraph, and (3) reasons over that subgraph using an auto-regressive LLM.

Main Contribution

KG-GPT: a general three-stage pipeline (Sentence Segmentation, Graph Retrieval, Inference) that uses an LLM end-to-end on KG tasks.

A concrete relation-candidate retrieval method that uses DBpedia and a TypeDBpedia to form per-subsentence relation sets.

Empirical evaluation showing competitive few-shot performance on FACTKG (fact verification) and MetaQA (KGQA), plus error analysis and ablations on shots and top-K relations.

Key Findings

KG-GPT reaches 72.68% accuracy on FACTKG using evidence retrieval and few-shot prompts.

NumbersAccuracy 72.68% (KG-GPT) vs 77.65% (GEAR)

KG-GPT outperforms claim-only models and few-shot ChatGPT on FACTKG.

NumbersAbsolute gains: +7.48% vs BERT; +4.20% vs 12-shot ChatGPT

High Hits@1 on MetaQA across hops shows robust multi-hop reasoning.

NumbersHits@1 = 96.3% (1-hop), 94.4% (2-hop), 94.0% (3-hop)

Most errors in multi-hop cases come from sentence segmentation.

NumbersSentence Segmentation errors: 100 of 100 in MetaQA 3-hop error sample

Top-K relation size has little effect on FACTKG but affects MetaQA.

NumbersFACTKG accuracy stable for k=3/5/10: 72.12/72.68/72.4

Results

Accuracy

Value72.68%

BaselineGEAR 77.65%

Accuracy

ValueBERT 65.20%, BlueBERT 59.93%, Flan-T5 62.70%, ChatGPT 68.48%

MetaQA Hits@1 (1-hop / 2-hop / 3-hop)

Value96.3% / 94.4% / 94.0%

BaselineSelected supervised baselines vary (e.g., KV-Mem, GraftNet, UniKGQA)

Top-K sensitivity (FACTKG k=3/5/10)

Value72.12% / 72.68% / 72.4%

Who Should Care

What To Try In 7 Days

Run KG-GPT with your domain KG on a small QA or claim set to see value without labeling.

Build segmentation prompts and test error rates per stage to find the bottleneck.

Compare top-K settings and measure how many triples you retrieve per query.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Strong dependence on in-context learning; performance changes with number and quality of examples (Sec 4.4.1, Limitations).
  • Few-shot setup lags behind fully supervised KG-specific models (72.68% vs 77.65% on FACTKG).
  • Sentence segmentation is a major failure point for diverse, multi-hop queries (Table 3).

When Not To Use

  • When you have abundant labeled KG training data and can train a specialized supervised retriever like GEAR.
  • When low-latency or low-cost inference is required, since LLM calls are expensive.
  • When sentence structure is highly noisy and segmentation cannot be stabilized with prompts.

Failure Modes

  • Incorrect sentence segmentation leads to wrong relation candidates and downstream errors (high in multi-hop).
  • Missing or noisy KG entries reduce evidence retrieval quality and make inference fail.
  • Over-reliance on a fixed set of in-context examples causes brittle generalization to new styles.

Core Entities

Models

  • ChatGPT
  • Flan-T5
  • BERT
  • BlueBERT
  • GEAR
  • KV-Mem
  • GraftNet
  • EmbedKGQA
  • NSM
  • UniKGQA
  • KG-GPT

Metrics

  • Accuracy
  • Hits@1

Datasets

  • FACTKG
  • MetaQA
  • DBpedia
  • TypeDBpedia

Benchmarks

  • FACTKG
  • MetaQA