Use an LLM to break sentences, pull subgraphs, and reason over knowledge graphs

Overview

Decision SnapshotNeeds Validation

Uses LLM prompts to chain segmentation, relation selection, and inference; works well in few-shot tests but depends on example quality and segmentation accuracy.

Citations3

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 50%

Authors

Jiho Kim, Yeonsu Kwon, Yohan Jo, Edward Choi

Links

Abstract / PDF / Code

Why It Matters For Business

KG-GPT lets you add structured KG reasoning to LLM pipelines with little labeled data. Use it to prototype fact verification or KGQA systems quickly before investing in custom supervised retrievers.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

KG-GPT is a three-step framework that uses large language models (LLMs) to do knowledge-graph tasks without full supervision. It first splits a claim or question into sub-sentences, then uses the LLM to pick candidate relations and retrieve a sub-graph, and finally asks the LLM to infer the answer from the linearized triples. In few-shot tests on FACTKG (fact verification) and MetaQA (KGQA), KG-GPT is competitive with many supervised baselines: it scores ~72.7% accuracy on FACTKG and ~96/94/94% Hits@1 on MetaQA 1/2/3-hop. The method is robust across hops but depends strongly on prompt examples and on correct sentence segmentation.

Problem Statement

LLMs excel at free text but are underused for structured reasoning on knowledge graphs. There is no general, few-shot framework that (1) maps natural sentences to KG relations, (2) retrieves an evidence subgraph, and (3) reasons over that subgraph using an auto-regressive LLM.

Main Contribution

KG-GPT: a general three-stage pipeline (Sentence Segmentation, Graph Retrieval, Inference) that uses an LLM end-to-end on KG tasks.

A concrete relation-candidate retrieval method that uses DBpedia and a TypeDBpedia to form per-subsentence relation sets.

Key Findings

KG-GPT reaches 72.68% accuracy on FACTKG using evidence retrieval and few-shot prompts.

NumbersAccuracy 72.68% (KG-GPT) vs 77.65% (GEAR)

Practical UseIf you lack labeled KG training data, KG-GPT gives a practical few-shot verifier; expect a small performance gap vs specialized supervised KG models.

Evidence RefTable 1; Sec 4.1

KG-GPT outperforms claim-only models and few-shot ChatGPT on FACTKG.

NumbersAbsolute gains: +7.48% vs BERT; +4.20% vs 12-shot ChatGPT

Practical UseAdding explicit graph retrieval and evidence to prompts can beat vanilla LLM classification on claims—so build a retrieval stage when verifying facts.

Evidence RefTable 1; Sec 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	72.68%	GEAR 77.65%	-4.97 pp	FACTKG	KG-GPT with evidence retrieval; few-shot setting	Table 1; Sec 4.1
Accuracy	BERT 65.20%, BlueBERT 59.93%, Flan-T5 62.70%, ChatGPT 68.48%	—	—	FACTKG	Claim-only models reported for comparison	Table 1; Sec 4.1

What To Try In 7 Days

Run KG-GPT with your domain KG on a small QA or claim set to see value without labeling.

Build segmentation prompts and test error rates per stage to find the bottleneck.

Compare top-K settings and measure how many triples you retrieve per query.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/jiho283/

Risks & Boundaries

Limitations

Strong dependence on in-context learning; performance changes with number and quality of examples (Sec 4.4.1, Limitations).

Few-shot setup lags behind fully supervised KG-specific models (72.68% vs 77.65% on FACTKG).

When Not To Use

When you have abundant labeled KG training data and can train a specialized supervised retriever like GEAR.

When low-latency or low-cost inference is required, since LLM calls are expensive.

Failure Modes

Incorrect sentence segmentation leads to wrong relation candidates and downstream errors (high in multi-hop).

Missing or noisy KG entries reduce evidence retrieval quality and make inference fail.

Core Entities

Models

ChatGPTFlan-T5BERTBlueBERTGEARKV-MemGraftNetEmbedKGQANSMUniKGQAKG-GPT

Metrics

AccuracyHits@1

Datasets

FACTKGMetaQADBpediaTypeDBpedia

Benchmarks

FACTKGMetaQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KG-GPT reaches 72.68% accuracy on FACTKG using evidence retrieval and few-shot prompts.

KG-GPT outperforms claim-only models and few-shot ChatGPT on FACTKG.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding