KnowGPT: use RL to pick concise KG facts and a bandit to pick prompt formats for closed‑box LLMs

Overview

Decision SnapshotReady For Pilot

The method is straightforward to reproduce: RL for concise KG path extraction plus a bandit to pick formats. Experiments on three public datasets back the claims, but real deployment depends on KG quality and API budgets.

Citations4

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Qinggang Zhang, Junnan Dong, Hao Chen, Daochen Zha, Zailiang Yu, Xiao Huang

Links

Abstract / PDF / Data

Why It Matters For Business

KnowGPT upgrades closed‑box LLM accuracy using existing KGs while trimming prompt size and API costs. It lets teams improve domain QA without fine‑tuning large models or owning model weights.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

KnowGPT is a practical pipeline to inject structured facts from knowledge graphs (KGs) into closed‑box LLMs via prompts. It uses a reinforcement‑learning agent to extract short, relevant KG paths and a contextual multi‑armed bandit to pick how to present those facts (triples, sentences, or graph descriptions). Built on GPT‑3.5 and tested on CommonsenseQA, OpenBookQA and MedQA, it gives large accuracy gains (e.g., ~92.4% on OpenBookQA test; leaderboard 92.6%) while cutting average prompt size and API cost versus other KG‑prompting methods. Main limits: noisy or incomplete KGs and remaining API cost.

Problem Statement

Given a question, a large KG and only API access to an LLM, create a short, factual prompt that improves QA accuracy. Challenges: KGs are huge, API calls cost money and tokens, and hand‑crafted hard prompts are brittle across questions and KG structures.

Main Contribution

Define KG‑based prompting for black‑box LLMs that builds prompts from subgraphs.

P RL: a deep reinforcement learning policy that extracts concise, context‑relevant KG paths as reasoning background.

Key Findings

KnowGPT raises QA accuracy substantially over baseline LLMs on three datasets

NumbersAvg +23.7% vs GPT‑3.5; Avg +2.9% vs GPT‑4

Practical UseIf you can call a closed LLM by API, adding concise KG prompts with KnowGPT can materially improve QA accuracy versus zero‑shot LLM use.

Evidence RefMain Results, Table 1

OpenBookQA leaderboard performance reaches human‑level

Numbers92.6% accuracy on OpenBookQA leaderboard (human 91.7%)

Practical UseFor science‑fact style QA, KnowGPT can push a black‑box LLM to near human leaderboard scores without model fine‑tuning.

Evidence RefAbstract, Section 4.2 and Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.924	—	—	OpenBookQA test	Table 1 main results	Table 1
Accuracy	0.926	Human 0.917	+0.009	OpenBookQA leaderboard	Section 4.2.1 and Table 2	Table 2

What To Try In 7 Days

Run simple entity linking and 2‑hop subgraph extraction (P_sub) on your domain KG and feed as 'sentence' prompts to your LLM API to check gains.

Implement off‑line RL path sampler on a small KG region to extract short paths and compare accuracy and token use vs full subgraph.

Train a light MAB or UCB selector to pick prompt format per question and measure API cost and accuracy tradeoffs.

Optimization Features

Token Efficiency

reduces avg tokens to 348 on MedQA vs larger baselines

System Optimization

LoRA

Training Optimization

Inference Optimization

concise path extraction to reduce prompt tokensMAB to avoid expensive prompt trials at runtime

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

CommonsenseQAOpenBookQAMedQA-USMLEConceptNetUSMLE KG mentioned in paper

Risks & Boundaries

Limitations

Real‑world KGs contain noisy or incorrect triples that can mislead the LLM (Section 5).

RL retrieval fails when KG is sparse or entities have few neighbors; P_sub fallback is needed (C.4).

When Not To Use

When no reliable KG exists for the domain.

When latency or zero external API usage is mandatory.

Failure Modes

Noisy KG facts cause confident but wrong LLM outputs.

RL policy cannot find reachable paths in sparse graphs and yields poor prompts.

Core Entities

Models

GPT-3GPT-3.5 (gpt-3.5-turbo)GPT-4text-davinci-002Bert-BaseRoBerta-largeSapBERT

Metrics

Accuracy

Datasets

CommonsenseQAOpenBookQAMedQA-USMLE

Benchmarks

OpenBookQA leaderboard

Context Entities

Models

AristoRoBERTaChatGLMLLaMA, Baichuan variants

Metrics

tokensAPI cost

Datasets

IH split of CSQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

KnowGPT raises QA accuracy substantially over baseline LLMs on three datasets

OpenBookQA leaderboard performance reaches human‑level

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding