Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
KnowGPT upgrades closed‑box LLM accuracy using existing KGs while trimming prompt size and API costs. It lets teams improve domain QA without fine‑tuning large models or owning model weights.
Summary TLDR
KnowGPT is a practical pipeline to inject structured facts from knowledge graphs (KGs) into closed‑box LLMs via prompts. It uses a reinforcement‑learning agent to extract short, relevant KG paths and a contextual multi‑armed bandit to pick how to present those facts (triples, sentences, or graph descriptions). Built on GPT‑3.5 and tested on CommonsenseQA, OpenBookQA and MedQA, it gives large accuracy gains (e.g., ~92.4% on OpenBookQA test; leaderboard 92.6%) while cutting average prompt size and API cost versus other KG‑prompting methods. Main limits: noisy or incomplete KGs and remaining API cost.
Problem Statement
Given a question, a large KG and only API access to an LLM, create a short, factual prompt that improves QA accuracy. Challenges: KGs are huge, API calls cost money and tokens, and hand‑crafted hard prompts are brittle across questions and KG structures.
Main Contribution
Define KG‑based prompting for black‑box LLMs that builds prompts from subgraphs.
P RL: a deep reinforcement learning policy that extracts concise, context‑relevant KG paths as reasoning background.
A Multi‑Armed Bandit (MAB) selector that learns which extraction method and prompt format (triples/sentences/graph description) works per question.
Key Findings
KnowGPT raises QA accuracy substantially over baseline LLMs on three datasets
OpenBookQA leaderboard performance reaches human‑level
RL extraction (P_RL) outperforms a 2‑hop heuristic subgraph (P_sub)
KnowGPT reduces tokens and API cost versus other KG‑prompting methods
Results
Accuracy
Accuracy
Accuracy
Accuracy
API cost per evaluation (MedQA)
Who Should Care
What To Try In 7 Days
Run simple entity linking and 2‑hop subgraph extraction (P_sub) on your domain KG and feed as 'sentence' prompts to your LLM API to check gains.
Implement off‑line RL path sampler on a small KG region to extract short paths and compare accuracy and token use vs full subgraph.
Train a light MAB or UCB selector to pick prompt format per question and measure API cost and accuracy tradeoffs.
Optimization Features
Token Efficiency
- reduces avg tokens to 348 on MedQA vs larger baselines
System Optimization
- LoRA
Training Optimization
- RL
Inference Optimization
- concise path extraction to reduce prompt tokens
- MAB to avoid expensive prompt trials at runtime
Reproducibility
Data Urls
- CommonsenseQA
- OpenBookQA
- MedQA-USMLE
- ConceptNet
- USMLE KG mentioned in paper
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Real‑world KGs contain noisy or incorrect triples that can mislead the LLM (Section 5).
- RL retrieval fails when KG is sparse or entities have few neighbors; P_sub fallback is needed (C.4).
- Method still requires API calls and non‑zero cost; savings are relative not free (Table 9).
When Not To Use
- When no reliable KG exists for the domain.
- When latency or zero external API usage is mandatory.
- When KG noise cannot be filtered or cleaned.
Failure Modes
- Noisy KG facts cause confident but wrong LLM outputs.
- RL policy cannot find reachable paths in sparse graphs and yields poor prompts.
- MAB can select suboptimal template early, hurting some question types.
Core Entities
Models
- GPT-3
- GPT-3.5 (gpt-3.5-turbo)
- GPT-4
- text-davinci-002
- Bert-Base
- RoBerta-large
- SapBERT
Metrics
- Accuracy
Datasets
- CommonsenseQA
- OpenBookQA
- MedQA-USMLE
Benchmarks
- OpenBookQA leaderboard
Context Entities
Models
- AristoRoBERTa
- ChatGLM
- LLaMA, Baichuan variants
Metrics
- tokens
- API cost
Datasets
- IH split of CSQA

