KnowGPT: use RL to pick concise KG facts and a bandit to pick prompt formats for closed‑box LLMs

December 11, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

4

Authors

Qinggang Zhang, Junnan Dong, Hao Chen, Daochen Zha, Zailiang Yu, Xiao Huang

Links

Abstract / PDF

Why It Matters For Business

KnowGPT upgrades closed‑box LLM accuracy using existing KGs while trimming prompt size and API costs. It lets teams improve domain QA without fine‑tuning large models or owning model weights.

Summary TLDR

KnowGPT is a practical pipeline to inject structured facts from knowledge graphs (KGs) into closed‑box LLMs via prompts. It uses a reinforcement‑learning agent to extract short, relevant KG paths and a contextual multi‑armed bandit to pick how to present those facts (triples, sentences, or graph descriptions). Built on GPT‑3.5 and tested on CommonsenseQA, OpenBookQA and MedQA, it gives large accuracy gains (e.g., ~92.4% on OpenBookQA test; leaderboard 92.6%) while cutting average prompt size and API cost versus other KG‑prompting methods. Main limits: noisy or incomplete KGs and remaining API cost.

Problem Statement

Given a question, a large KG and only API access to an LLM, create a short, factual prompt that improves QA accuracy. Challenges: KGs are huge, API calls cost money and tokens, and hand‑crafted hard prompts are brittle across questions and KG structures.

Main Contribution

Define KG‑based prompting for black‑box LLMs that builds prompts from subgraphs.

P RL: a deep reinforcement learning policy that extracts concise, context‑relevant KG paths as reasoning background.

A Multi‑Armed Bandit (MAB) selector that learns which extraction method and prompt format (triples/sentences/graph description) works per question.

Key Findings

KnowGPT raises QA accuracy substantially over baseline LLMs on three datasets

NumbersAvg +23.7% vs GPT‑3.5; Avg +2.9% vs GPT‑4

OpenBookQA leaderboard performance reaches human‑level

Numbers92.6% accuracy on OpenBookQA leaderboard (human 91.7%)

RL extraction (P_RL) outperforms a 2‑hop heuristic subgraph (P_sub)

NumbersCSQA IHtest: P_sub 73.9% → P_RL 80.0% (≈+6.1 pp) ; OBQA: 86.5% → 88.9% (+2.4 pp)

KnowGPT reduces tokens and API cost versus other KG‑prompting methods

NumbersAvg tokens 348; cost $6.64 on MedQA vs CoK tokens 1129 cost $21.54

Results

Accuracy

Value0.924

Accuracy

Value0.926

BaselineHuman 0.917

Accuracy

Value0.818

BaselineGPT-3.5 0.710

Accuracy

Value0.781

BaselineGPT-3.5 0.487

API cost per evaluation (MedQA)

Value$6.64

BaselineCoK $21.54; GPT‑4 $29.77

Who Should Care

What To Try In 7 Days

Run simple entity linking and 2‑hop subgraph extraction (P_sub) on your domain KG and feed as 'sentence' prompts to your LLM API to check gains.

Implement off‑line RL path sampler on a small KG region to extract short paths and compare accuracy and token use vs full subgraph.

Train a light MAB or UCB selector to pick prompt format per question and measure API cost and accuracy tradeoffs.

Optimization Features

Token Efficiency

  • reduces avg tokens to 348 on MedQA vs larger baselines

System Optimization

  • LoRA

Training Optimization

  • RL

Inference Optimization

  • concise path extraction to reduce prompt tokens
  • MAB to avoid expensive prompt trials at runtime

Reproducibility

Data Urls

  • CommonsenseQA
  • OpenBookQA
  • MedQA-USMLE
  • ConceptNet
  • USMLE KG mentioned in paper

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Real‑world KGs contain noisy or incorrect triples that can mislead the LLM (Section 5).
  • RL retrieval fails when KG is sparse or entities have few neighbors; P_sub fallback is needed (C.4).
  • Method still requires API calls and non‑zero cost; savings are relative not free (Table 9).

When Not To Use

  • When no reliable KG exists for the domain.
  • When latency or zero external API usage is mandatory.
  • When KG noise cannot be filtered or cleaned.

Failure Modes

  • Noisy KG facts cause confident but wrong LLM outputs.
  • RL policy cannot find reachable paths in sparse graphs and yields poor prompts.
  • MAB can select suboptimal template early, hurting some question types.

Core Entities

Models

  • GPT-3
  • GPT-3.5 (gpt-3.5-turbo)
  • GPT-4
  • text-davinci-002
  • Bert-Base
  • RoBerta-large
  • SapBERT

Metrics

  • Accuracy

Datasets

  • CommonsenseQA
  • OpenBookQA
  • MedQA-USMLE

Benchmarks

  • OpenBookQA leaderboard

Context Entities

Models

  • AristoRoBERTa
  • ChatGLM
  • LLaMA, Baichuan variants

Metrics

  • tokens
  • API cost

Datasets

  • IH split of CSQA