KG-Rank: combine a medical knowledge graph with triplet ranking to make long-form medical answers more factual

March 9, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

5

Authors

Rui Yang, Haoran Liu, Edison Marrese-Taylor, Qingcheng Zeng, Yu He Ke, Wanxin Li, Lechao Cheng, Qingyu Chen, James Caverlee, Yutaka Matsuo, Irene Li

Links

Abstract / PDF

Why It Matters For Business

KG-Rank reduces factual errors in long-form domain answers by selecting relevant KG facts before generation, making prototypes for clinical documentation, help centers, or domain QA more reliable; still require clinician review and careful deployment.

Summary TLDR

KG-Rank augments large language models with a medical knowledge graph (UMLS) and three ranking steps (similarity, answer-expansion, MMR) plus a re-ranker (MedCPT) to select the most relevant KG triples before generating long answers. On four medical QA datasets it raises ROUGE-L substantially (example: ExpertQA-Bio ROUGE-L 23.00→27.20, +18.3%). It also transfers to open domains (e.g., ExpertQA-Law ROUGE-L 26.33→29.93). The pipeline reduces noise by filtering and re-ordering one-hop KG triples, but needs clinician validation and has extra compute from ranking.

Problem Statement

LLMs can generate fluent but factually inconsistent long answers in medicine. Simply appending raw KG retrieval brings noise and redundancy. We need a practical way to inject factual KG facts into LLMs for long-form medical QA while keeping context size manageable and relevant.

Main Contribution

KG-Rank: a pipeline that extracts one-hop triples from a medical KG (UMLS), ranks and re-ranks them, and feeds top triples to an LLM for long-answer QA.

Three triplet ranking strategies (similarity, answer-expansion, MMR) plus a domain-specific re-ranker (MedCPT) to remove irrelevant or redundant KG facts.

Empirical validation on four medical QA datasets and four open-domain ExpertQA subsets showing consistent metric gains and better factuality by LLM judges.

Key Findings

KG-Rank raised ROUGE-L on ExpertQA-Bio from 23.00 to 27.20.

NumbersROUGE-L 23.00 → 27.20 (+18.3%)

KG-Rank improved open-domain ExpertQA-Law ROUGE-L from 26.33 to 29.93.

NumbersROUGE-L 26.33 → 29.93 (+13.7%)

A medical re-ranker (MedCPT) consistently beat a general re-ranker (Cohere) in reranking triples.

NumbersExpertQA-Med ROUGE-L Cohere 27.59 → MedCPT 28.08 (+1.8%)

GPT-4 judged KG-Rank outputs preferred over zero-shot in majority counts.

NumbersMedQA judgments: Zero-shot 8, Tie 211, KG-Rank 468 (KG-Rank majority)

Results

ROUGE-L

Value27.20

Baseline23.00

ROUGE-L

Value28.08

Baseline25.45

ROUGE-L

Value16.19

Baseline14.41

ROUGE-L

Value19.44

Baseline18.89

Who Should Care

What To Try In 7 Days

Add a one-hop KG retrieval (UMLS or domain KB) and limit to top-k triples before prompting your LLM.

Implement a cheap similarity re-rank and test MedCPT or a domain re-ranker to prioritize factual triples.

Run a small A/B with clinician or expert review on 100 real queries to measure factual gains and verify safety.

Agent Features

Tool Use

  • KG retrieval
  • cross-encoder re-ranking
  • LLM generation

Optimization Features

Token Efficiency

  • input only top-ranked triples to save context tokens

Infra Optimization

  • GPU cluster (4x A100 in experiments); ranking adds compute overhead

System Optimization

  • use domain re-ranker (MedCPT) to reduce irrelevant context

Inference Optimization

  • reduce number of KG triplets input to LLM
  • use re-ranker to limit context size

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No physician-blinded evaluation reported; authors plan clinician validation later.
  • Ranking adds extra compute and latency; authors note need for efficiency improvements.
  • Performance varies by dataset; LiveQA gains are smaller and less consistent.

When Not To Use

  • For unsupervised clinical decision-making without clinician oversight.
  • Where ultra-low latency is required and extra ranking overhead is unacceptable.
  • If your domain lacks a reasonably complete knowledge graph.

Failure Modes

  • Retrieving many irrelevant triples if entity mapping is noisy, which can still mislead the LLM.
  • Ranking strategies can vary in effectiveness by dataset; no single ranker always best.
  • KG coverage gaps cause missing evidence for rare or novel clinical scenarios.

Core Entities

Models

  • GPT-4
  • LLaMa2-13b
  • LLaMa2-7b
  • baize-healthcare
  • MedCPT
  • UmlsBERT

Metrics

  • ROUGE-L
  • BERTScore
  • MoverScore
  • BLEURT
  • Accuracy
  • GPT-4 preference counts

Datasets

  • LiveQA
  • ExpertQA-Med
  • ExpertQA-Bio
  • MedicationQA
  • Mintaka
  • ExpertQA (Law, Business, Music, History subsets)

Benchmarks

  • ROUGE-L
  • BERTScore
  • MoverScore
  • BLEURT
  • GPT-4 factuality score