Overview
Production Readiness
0.5
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
8
Why It Matters For Business
Injecting curated UMLS content into prompts can raise factuality and completeness without costly model fine-tuning; it is a lower-cost way to make LLM answers safer for medical use, though user readability may require UX work.
Summary TLDR
The paper adds structured medical knowledge from the Unified Medical Language System (UMLS) into LLM prompts to improve medical question answering. They test LLaMa2-13b-chat and ChatGPT-3.5 on 104 LiveQA questions with automatic metrics (ROUGE, BERTScore) and a blind physician review on 20 questions. Results: automated scores improve for LLaMa2 but not for ChatGPT; physicians judged UMLS-augmented ChatGPT-3.5 better on factuality (UMLS better for 40% of questions, tie 30%, worse 30%) and completeness (UMLS better 55%), while raw ChatGPT retained a slight edge in readability. Main trade-offs: added domain detail can reduce readability and irrelevant UMLS relations can add noise.
Problem Statement
Large LLMs can generate fluent but medically incorrect or biased answers because they lack grounded, structured medical knowledge. Fine-tuning on medical data is costly and stale. The paper asks whether injecting a curated medical knowledge base (UMLS) at inference can make LLM answers more factual, explainable, and useful for medical QA.
Main Contribution
A prompt-augmentation framework that fetches UMLS concept definitions and relations via Concept Unique Identifiers (CUIs) and inserts them into LLM prompts.
A comparison of three terminology extraction methods: direct LLM extraction, indirect LLM extraction, and a biomedical NER model.
Evaluation on TREC LiveQA (104 test questions) with ROUGE/BERTScore and a blind physician review (20 questions) across Factuality, Completeness, Readability, and Relevance.
Empirical finding that UMLS-augmentation helps a smaller model (LLaMa2-13b-chat) on automated metrics and improves physician-judged factuality and completeness for ChatGPT-3.5, with a readability trade-off.
Key Findings
UMLS augmentation raised LLaMa2-13b-chat ROUGE-1 from 19.07 to 19.97 on LiveQA.
Adding UMLS to ChatGPT-3.5 did not increase automated scores and slightly lowered ROUGE-1.
Physician blind review found UMLS-augmented ChatGPT-3.5 better on factuality for 40% of sampled questions and better on completeness for 55%.
Readability slightly favored base ChatGPT-3.5 over UMLS-augmented ChatGPT-3.5.
Results
ROUGE R-1 (ChatGPT-3.5)
ROUGE R-1 (ChatGPT-3.5 + UMLS Direct Extraction)
ROUGE R-1 (LLaMa2-13b-chat)
Physician judgments (Factuality)
Physician judgments (Completeness)
Who Should Care
What To Try In 7 Days
Build a simple pipeline: extract terms → map to UMLS CUI → fetch definitions → append to prompt.
Compare direct vs indirect term extraction on a small QA sample to see which yields fewer irrelevant relations.
Run a small blind review with clinicians on 20–30 common questions to judge factuality and completeness.
Reproducibility
Data Urls
- TREC LiveQA 2017 (public dataset)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Only top-25 relations per CUI are used; many fetched relations may be irrelevant.
- Automatic metrics (ROUGE/BERTScore) are imperfect for medical QA and references lack expert revision.
- Physician evaluation is small (20 questions), limiting generality.
- Term extraction can miss or overgenerate concepts; wrong extractions produce bad retrievals.
When Not To Use
- When you need up-to-the-minute medical updates not present in UMLS.
- When privacy rules forbid external KB queries for patient data.
- When user-facing readability is the top priority and extra technical detail will confuse users.
Failure Modes
- Incorrect or missing medical term extraction leads to wrong UMLS retrieval and hallucinations.
- Retrieving many irrelevant relations dilutes useful context and confuses the LLM.
- Dense prompt of definitions can reduce readability or make the model ignore the important context.
Core Entities
Models
- ChatGPT-3.5
- LLaMa2-13b-chat
Metrics
- ROUGE R-1
- ROUGE R-2
- ROUGE R-L
- BERTScore P
- BERTScore R
- BERTScore F1
- Physician Factuality
- Physician Completeness
- Physician Readability
- Physician Relevance
Datasets
- TREC LiveQA 2017 (LiveQA)
Benchmarks
- ROUGE
- BERTScore

