Use UMLS definitions and relations to make LLM answers more factual and complete for medical questions

Overview

Decision SnapshotNeeds Validation

UMLS prompt injection is a practical, low-cost way to add domain facts. Evidence is limited to 104 automatic examples and 20 physician reviews, so expect more validation before deployment.

Citations8

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 40%

Authors

Rui Yang, Edison Marrese-Taylor, Yuhe Ke, Lechao Cheng, Qingyu Chen, Irene Li

Links

Abstract / PDF / Data

Why It Matters For Business

Injecting curated UMLS content into prompts can raise factuality and completeness without costly model fine-tuning; it is a lower-cost way to make LLM answers safer for medical use, though user readability may require UX work.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

The paper adds structured medical knowledge from the Unified Medical Language System (UMLS) into LLM prompts to improve medical question answering. They test LLaMa2-13b-chat and ChatGPT-3.5 on 104 LiveQA questions with automatic metrics (ROUGE, BERTScore) and a blind physician review on 20 questions. Results: automated scores improve for LLaMa2 but not for ChatGPT; physicians judged UMLS-augmented ChatGPT-3.5 better on factuality (UMLS better for 40% of questions, tie 30%, worse 30%) and completeness (UMLS better 55%), while raw ChatGPT retained a slight edge in readability. Main trade-offs: added domain detail can reduce readability and irrelevant UMLS relations can add noise.

Problem Statement

Large LLMs can generate fluent but medically incorrect or biased answers because they lack grounded, structured medical knowledge. Fine-tuning on medical data is costly and stale. The paper asks whether injecting a curated medical knowledge base (UMLS) at inference can make LLM answers more factual, explainable, and useful for medical QA.

Main Contribution

A prompt-augmentation framework that fetches UMLS concept definitions and relations via Concept Unique Identifiers (CUIs) and inserts them into LLM prompts.

A comparison of three terminology extraction methods: direct LLM extraction, indirect LLM extraction, and a biomedical NER model.

Key Findings

UMLS augmentation raised LLaMa2-13b-chat ROUGE-1 from 19.07 to 19.97 on LiveQA.

NumbersR-1 +0.90 (19.07 → 19.97)

Practical UseIf you run a mid-sized open model, adding UMLS definitions in prompts can measurably improve automated summary metrics.

Evidence RefTable 3

Adding UMLS to ChatGPT-3.5 did not increase automated scores and slightly lowered ROUGE-1.

NumbersR-1 −0.11 (21.44 → 21.33)

Practical UseFor already strong LLMs, prompt-based UMLS injection may not improve automated metrics; rely on human judgment instead.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ROUGE R-1 (ChatGPT-3.5)	21.44	—	—	LiveQA test (104 q)	Table 3 reports ChatGPT-3.5 R-1 21.44	Table 3
ROUGE R-1 (ChatGPT-3.5 + UMLS Direct Extraction)	21.33	ChatGPT-3.5 21.44	−0.11	LiveQA test (104 q)	Table 3 shows slight drop when adding UMLS to ChatGPT-3.5	Table 3

What To Try In 7 Days

Build a simple pipeline: extract terms → map to UMLS CUI → fetch definitions → append to prompt.

Compare direct vs indirect term extraction on a small QA sample to see which yields fewer irrelevant relations.

Run a small blind review with clinicians on 20–30 common questions to judge factuality and completeness.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

TREC LiveQA 2017 (public dataset)

Risks & Boundaries

Limitations

Only top-25 relations per CUI are used; many fetched relations may be irrelevant.

Automatic metrics (ROUGE/BERTScore) are imperfect for medical QA and references lack expert revision.

When Not To Use

When you need up-to-the-minute medical updates not present in UMLS.

When privacy rules forbid external KB queries for patient data.

Failure Modes

Incorrect or missing medical term extraction leads to wrong UMLS retrieval and hallucinations.

Retrieving many irrelevant relations dilutes useful context and confuses the LLM.

Core Entities

Models

ChatGPT-3.5LLaMa2-13b-chat

Metrics

ROUGE R-1ROUGE R-2ROUGE R-LBERTScore PBERTScore RBERTScore F1Physician FactualityPhysician CompletenessPhysician ReadabilityPhysician Relevance

Datasets

TREC LiveQA 2017 (LiveQA)

Benchmarks

ROUGEBERTScore

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

UMLS augmentation raised LLaMa2-13b-chat ROUGE-1 from 19.07 to 19.97 on LiveQA.

Adding UMLS to ChatGPT-3.5 did not increase automated scores and slightly lowered ROUGE-1.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Turn an LLM output into a mini knowledge graph, check each fact with an NLI model, and get explainable hallucination flags

Key finding

Combine LLMs with a medical knowledge graph to get more accurate, verifiable scientific answers

Key finding

Use a personal causal graph so an LLM recommends foods that better lower your post-meal glucose

Key finding

A practical survey showing how knowledge graphs can make LLMs better at complex question answering

Key finding

MindMap: prompt LLMs with knowledge-graph evidence to produce explicit graph-style reasoning and reduce hallucination

Key finding