Use UMLS definitions and relations to make LLM answers more factual and complete for medical questions

October 4, 20237 min

Overview

Decision SnapshotNeeds Validation

UMLS prompt injection is a practical, low-cost way to add domain facts. Evidence is limited to 104 automatic examples and 20 physician reviews, so expect more validation before deployment.

Citations8

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 50%

Novelty: 40%

Authors

Rui Yang, Edison Marrese-Taylor, Yuhe Ke, Lechao Cheng, Qingyu Chen, Irene Li

Links

Abstract / PDF / Data

Why It Matters For Business

Injecting curated UMLS content into prompts can raise factuality and completeness without costly model fine-tuning; it is a lower-cost way to make LLM answers safer for medical use, though user readability may require UX work.

Who Should Care

Summary TLDR

The paper adds structured medical knowledge from the Unified Medical Language System (UMLS) into LLM prompts to improve medical question answering. They test LLaMa2-13b-chat and ChatGPT-3.5 on 104 LiveQA questions with automatic metrics (ROUGE, BERTScore) and a blind physician review on 20 questions. Results: automated scores improve for LLaMa2 but not for ChatGPT; physicians judged UMLS-augmented ChatGPT-3.5 better on factuality (UMLS better for 40% of questions, tie 30%, worse 30%) and completeness (UMLS better 55%), while raw ChatGPT retained a slight edge in readability. Main trade-offs: added domain detail can reduce readability and irrelevant UMLS relations can add noise.

Problem Statement

Large LLMs can generate fluent but medically incorrect or biased answers because they lack grounded, structured medical knowledge. Fine-tuning on medical data is costly and stale. The paper asks whether injecting a curated medical knowledge base (UMLS) at inference can make LLM answers more factual, explainable, and useful for medical QA.

Main Contribution

A prompt-augmentation framework that fetches UMLS concept definitions and relations via Concept Unique Identifiers (CUIs) and inserts them into LLM prompts.

A comparison of three terminology extraction methods: direct LLM extraction, indirect LLM extraction, and a biomedical NER model.

Key Findings

UMLS augmentation raised LLaMa2-13b-chat ROUGE-1 from 19.07 to 19.97 on LiveQA.

NumbersR-1 +0.90 (19.0719.97)

Practical UseIf you run a mid-sized open model, adding UMLS definitions in prompts can measurably improve automated summary metrics.

Evidence RefTable 3

Adding UMLS to ChatGPT-3.5 did not increase automated scores and slightly lowered ROUGE-1.

NumbersR-1 −0.11 (21.4421.33)

Practical UseFor already strong LLMs, prompt-based UMLS injection may not improve automated metrics; rely on human judgment instead.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ROUGE R-1 (ChatGPT-3.5)21.44LiveQA test (104 q)Table 3 reports ChatGPT-3.5 R-1 21.44Table 3
ROUGE R-1 (ChatGPT-3.5 + UMLS Direct Extraction)21.33ChatGPT-3.5 21.44−0.11LiveQA test (104 q)Table 3 shows slight drop when adding UMLS to ChatGPT-3.5Table 3

What To Try In 7 Days

Build a simple pipeline: extract terms → map to UMLS CUI → fetch definitions → append to prompt.

Compare direct vs indirect term extraction on a small QA sample to see which yields fewer irrelevant relations.

Run a small blind review with clinicians on 20–30 common questions to judge factuality and completeness.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

TREC LiveQA 2017 (public dataset)

Risks & Boundaries

Limitations

Only top-25 relations per CUI are used; many fetched relations may be irrelevant.

Automatic metrics (ROUGE/BERTScore) are imperfect for medical QA and references lack expert revision.

When Not To Use

When you need up-to-the-minute medical updates not present in UMLS.

When privacy rules forbid external KB queries for patient data.

Failure Modes

Incorrect or missing medical term extraction leads to wrong UMLS retrieval and hallucinations.

Retrieving many irrelevant relations dilutes useful context and confuses the LLM.

Core Entities

Models

ChatGPT-3.5LLaMa2-13b-chat

Metrics

ROUGE R-1ROUGE R-2ROUGE R-LBERTScore PBERTScore RBERTScore F1Physician FactualityPhysician CompletenessPhysician ReadabilityPhysician Relevance

Datasets

TREC LiveQA 2017 (LiveQA)

Benchmarks

ROUGEBERTScore