Use UMLS definitions and relations to make LLM answers more factual and complete for medical questions

October 4, 20237 min

Overview

Production Readiness

0.5

Novelty Score

0.4

Cost Impact Score

0.7

Citation Count

8

Authors

Rui Yang, Edison Marrese-Taylor, Yuhe Ke, Lechao Cheng, Qingyu Chen, Irene Li

Links

Abstract / PDF

Why It Matters For Business

Injecting curated UMLS content into prompts can raise factuality and completeness without costly model fine-tuning; it is a lower-cost way to make LLM answers safer for medical use, though user readability may require UX work.

Summary TLDR

The paper adds structured medical knowledge from the Unified Medical Language System (UMLS) into LLM prompts to improve medical question answering. They test LLaMa2-13b-chat and ChatGPT-3.5 on 104 LiveQA questions with automatic metrics (ROUGE, BERTScore) and a blind physician review on 20 questions. Results: automated scores improve for LLaMa2 but not for ChatGPT; physicians judged UMLS-augmented ChatGPT-3.5 better on factuality (UMLS better for 40% of questions, tie 30%, worse 30%) and completeness (UMLS better 55%), while raw ChatGPT retained a slight edge in readability. Main trade-offs: added domain detail can reduce readability and irrelevant UMLS relations can add noise.

Problem Statement

Large LLMs can generate fluent but medically incorrect or biased answers because they lack grounded, structured medical knowledge. Fine-tuning on medical data is costly and stale. The paper asks whether injecting a curated medical knowledge base (UMLS) at inference can make LLM answers more factual, explainable, and useful for medical QA.

Main Contribution

A prompt-augmentation framework that fetches UMLS concept definitions and relations via Concept Unique Identifiers (CUIs) and inserts them into LLM prompts.

A comparison of three terminology extraction methods: direct LLM extraction, indirect LLM extraction, and a biomedical NER model.

Evaluation on TREC LiveQA (104 test questions) with ROUGE/BERTScore and a blind physician review (20 questions) across Factuality, Completeness, Readability, and Relevance.

Empirical finding that UMLS-augmentation helps a smaller model (LLaMa2-13b-chat) on automated metrics and improves physician-judged factuality and completeness for ChatGPT-3.5, with a readability trade-off.

Key Findings

UMLS augmentation raised LLaMa2-13b-chat ROUGE-1 from 19.07 to 19.97 on LiveQA.

NumbersR-1 +0.90 (19.07 → 19.97)

Adding UMLS to ChatGPT-3.5 did not increase automated scores and slightly lowered ROUGE-1.

NumbersR-1 −0.11 (21.44 → 21.33)

Physician blind review found UMLS-augmented ChatGPT-3.5 better on factuality for 40% of sampled questions and better on completeness for 55%.

NumbersFactuality: UMLS better 40% (tie 30%, worse 30%); Completeness: UMLS better 55%

Readability slightly favored base ChatGPT-3.5 over UMLS-augmented ChatGPT-3.5.

NumbersReadability wins: ChatGPT 45% vs UMLS 40%

Results

ROUGE R-1 (ChatGPT-3.5)

Value21.44

ROUGE R-1 (ChatGPT-3.5 + UMLS Direct Extraction)

Value21.33

BaselineChatGPT-3.5 21.44

ROUGE R-1 (LLaMa2-13b-chat)

Value19.07 → 19.97 (with UMLS Direct Extraction)

BaselineLLaMa2-13b-chat 19.07

Physician judgments (Factuality)

ValueUMLS better 40% / tie 30% / worse 30%

BaselineChatGPT-3.5

Physician judgments (Completeness)

ValueUMLS better 55%

BaselineChatGPT-3.5

Who Should Care

What To Try In 7 Days

Build a simple pipeline: extract terms → map to UMLS CUI → fetch definitions → append to prompt.

Compare direct vs indirect term extraction on a small QA sample to see which yields fewer irrelevant relations.

Run a small blind review with clinicians on 20–30 common questions to judge factuality and completeness.

Reproducibility

Data Urls

  • TREC LiveQA 2017 (public dataset)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Only top-25 relations per CUI are used; many fetched relations may be irrelevant.
  • Automatic metrics (ROUGE/BERTScore) are imperfect for medical QA and references lack expert revision.
  • Physician evaluation is small (20 questions), limiting generality.
  • Term extraction can miss or overgenerate concepts; wrong extractions produce bad retrievals.

When Not To Use

  • When you need up-to-the-minute medical updates not present in UMLS.
  • When privacy rules forbid external KB queries for patient data.
  • When user-facing readability is the top priority and extra technical detail will confuse users.

Failure Modes

  • Incorrect or missing medical term extraction leads to wrong UMLS retrieval and hallucinations.
  • Retrieving many irrelevant relations dilutes useful context and confuses the LLM.
  • Dense prompt of definitions can reduce readability or make the model ignore the important context.

Core Entities

Models

  • ChatGPT-3.5
  • LLaMa2-13b-chat

Metrics

  • ROUGE R-1
  • ROUGE R-2
  • ROUGE R-L
  • BERTScore P
  • BERTScore R
  • BERTScore F1
  • Physician Factuality
  • Physician Completeness
  • Physician Readability
  • Physician Relevance

Datasets

  • TREC LiveQA 2017 (LiveQA)

Benchmarks

  • ROUGE
  • BERTScore