Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
5
Why It Matters For Business
KG-Rank reduces factual errors in long-form domain answers by selecting relevant KG facts before generation, making prototypes for clinical documentation, help centers, or domain QA more reliable; still require clinician review and careful deployment.
Summary TLDR
KG-Rank augments large language models with a medical knowledge graph (UMLS) and three ranking steps (similarity, answer-expansion, MMR) plus a re-ranker (MedCPT) to select the most relevant KG triples before generating long answers. On four medical QA datasets it raises ROUGE-L substantially (example: ExpertQA-Bio ROUGE-L 23.00→27.20, +18.3%). It also transfers to open domains (e.g., ExpertQA-Law ROUGE-L 26.33→29.93). The pipeline reduces noise by filtering and re-ordering one-hop KG triples, but needs clinician validation and has extra compute from ranking.
Problem Statement
LLMs can generate fluent but factually inconsistent long answers in medicine. Simply appending raw KG retrieval brings noise and redundancy. We need a practical way to inject factual KG facts into LLMs for long-form medical QA while keeping context size manageable and relevant.
Main Contribution
KG-Rank: a pipeline that extracts one-hop triples from a medical KG (UMLS), ranks and re-ranks them, and feeds top triples to an LLM for long-answer QA.
Three triplet ranking strategies (similarity, answer-expansion, MMR) plus a domain-specific re-ranker (MedCPT) to remove irrelevant or redundant KG facts.
Empirical validation on four medical QA datasets and four open-domain ExpertQA subsets showing consistent metric gains and better factuality by LLM judges.
Key Findings
KG-Rank raised ROUGE-L on ExpertQA-Bio from 23.00 to 27.20.
KG-Rank improved open-domain ExpertQA-Law ROUGE-L from 26.33 to 29.93.
A medical re-ranker (MedCPT) consistently beat a general re-ranker (Cohere) in reranking triples.
GPT-4 judged KG-Rank outputs preferred over zero-shot in majority counts.
Results
ROUGE-L
ROUGE-L
ROUGE-L
ROUGE-L
Who Should Care
What To Try In 7 Days
Add a one-hop KG retrieval (UMLS or domain KB) and limit to top-k triples before prompting your LLM.
Implement a cheap similarity re-rank and test MedCPT or a domain re-ranker to prioritize factual triples.
Run a small A/B with clinician or expert review on 100 real queries to measure factual gains and verify safety.
Agent Features
Tool Use
- KG retrieval
- cross-encoder re-ranking
- LLM generation
Optimization Features
Token Efficiency
- input only top-ranked triples to save context tokens
Infra Optimization
- GPU cluster (4x A100 in experiments); ranking adds compute overhead
System Optimization
- use domain re-ranker (MedCPT) to reduce irrelevant context
Inference Optimization
- reduce number of KG triplets input to LLM
- use re-ranker to limit context size
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- No physician-blinded evaluation reported; authors plan clinician validation later.
- Ranking adds extra compute and latency; authors note need for efficiency improvements.
- Performance varies by dataset; LiveQA gains are smaller and less consistent.
When Not To Use
- For unsupervised clinical decision-making without clinician oversight.
- Where ultra-low latency is required and extra ranking overhead is unacceptable.
- If your domain lacks a reasonably complete knowledge graph.
Failure Modes
- Retrieving many irrelevant triples if entity mapping is noisy, which can still mislead the LLM.
- Ranking strategies can vary in effectiveness by dataset; no single ranker always best.
- KG coverage gaps cause missing evidence for rare or novel clinical scenarios.
Core Entities
Models
- GPT-4
- LLaMa2-13b
- LLaMa2-7b
- baize-healthcare
- MedCPT
- UmlsBERT
Metrics
- ROUGE-L
- BERTScore
- MoverScore
- BLEURT
- Accuracy
- GPT-4 preference counts
Datasets
- LiveQA
- ExpertQA-Med
- ExpertQA-Bio
- MedicationQA
- Mintaka
- ExpertQA (Law, Business, Music, History subsets)
Benchmarks
- ROUGE-L
- BERTScore
- MoverScore
- BLEURT
- GPT-4 factuality score

