KnowGPT: use RL to pick concise KG facts and a bandit to pick prompt formats for closed‑box LLMs

December 11, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is straightforward to reproduce: RL for concise KG path extraction plus a bandit to pick formats. Experiments on three public datasets back the claims, but real deployment depends on KG quality and API budgets.

Citations4

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Qinggang Zhang, Junnan Dong, Hao Chen, Daochen Zha, Zailiang Yu, Xiao Huang

Links

Abstract / PDF / Data

Why It Matters For Business

KnowGPT upgrades closed‑box LLM accuracy using existing KGs while trimming prompt size and API costs. It lets teams improve domain QA without fine‑tuning large models or owning model weights.

Who Should Care

Summary TLDR

KnowGPT is a practical pipeline to inject structured facts from knowledge graphs (KGs) into closed‑box LLMs via prompts. It uses a reinforcement‑learning agent to extract short, relevant KG paths and a contextual multi‑armed bandit to pick how to present those facts (triples, sentences, or graph descriptions). Built on GPT‑3.5 and tested on CommonsenseQA, OpenBookQA and MedQA, it gives large accuracy gains (e.g., ~92.4% on OpenBookQA test; leaderboard 92.6%) while cutting average prompt size and API cost versus other KG‑prompting methods. Main limits: noisy or incomplete KGs and remaining API cost.

Problem Statement

Given a question, a large KG and only API access to an LLM, create a short, factual prompt that improves QA accuracy. Challenges: KGs are huge, API calls cost money and tokens, and hand‑crafted hard prompts are brittle across questions and KG structures.

Main Contribution

Define KG‑based prompting for black‑box LLMs that builds prompts from subgraphs.

P RL: a deep reinforcement learning policy that extracts concise, context‑relevant KG paths as reasoning background.

Key Findings

KnowGPT raises QA accuracy substantially over baseline LLMs on three datasets

NumbersAvg +23.7% vs GPT‑3.5; Avg +2.9% vs GPT‑4

Practical UseIf you can call a closed LLM by API, adding concise KG prompts with KnowGPT can materially improve QA accuracy versus zero‑shot LLM use.

Evidence RefMain Results, Table 1

OpenBookQA leaderboard performance reaches human‑level

Numbers92.6% accuracy on OpenBookQA leaderboard (human 91.7%)

Practical UseFor science‑fact style QA, KnowGPT can push a black‑box LLM to near human leaderboard scores without model fine‑tuning.

Evidence RefAbstract, Section 4.2 and Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.924OpenBookQA testTable 1 main resultsTable 1
Accuracy0.926Human 0.917+0.009OpenBookQA leaderboardSection 4.2.1 and Table 2Table 2

What To Try In 7 Days

Run simple entity linking and 2‑hop subgraph extraction (P_sub) on your domain KG and feed as 'sentence' prompts to your LLM API to check gains.

Implement off‑line RL path sampler on a small KG region to extract short paths and compare accuracy and token use vs full subgraph.

Train a light MAB or UCB selector to pick prompt format per question and measure API cost and accuracy tradeoffs.

Optimization Features

Token Efficiency
reduces avg tokens to 348 on MedQA vs larger baselines
System Optimization
LoRA
Training Optimization
RL
Inference Optimization
concise path extraction to reduce prompt tokensMAB to avoid expensive prompt trials at runtime

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

CommonsenseQAOpenBookQAMedQA-USMLEConceptNetUSMLE KG mentioned in paper

Risks & Boundaries

Limitations

Real‑world KGs contain noisy or incorrect triples that can mislead the LLM (Section 5).

RL retrieval fails when KG is sparse or entities have few neighbors; P_sub fallback is needed (C.4).

When Not To Use

When no reliable KG exists for the domain.

When latency or zero external API usage is mandatory.

Failure Modes

Noisy KG facts cause confident but wrong LLM outputs.

RL policy cannot find reachable paths in sparse graphs and yields poor prompts.

Core Entities

Models

GPT-3GPT-3.5 (gpt-3.5-turbo)GPT-4text-davinci-002Bert-BaseRoBerta-largeSapBERT

Metrics

Accuracy

Datasets

CommonsenseQAOpenBookQAMedQA-USMLE

Benchmarks

OpenBookQA leaderboard

Context Entities

Models

AristoRoBERTaChatGLMLLaMA, Baichuan variants

Metrics

tokensAPI cost

Datasets

IH split of CSQA