Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

September 20, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

6

Authors

Aidan Gilson, Xuguang Ai, Thilaka Arunachalam, Ziyou Chen, Ki Xiong Cheong, Amisha Dave, Cameron Duic, Mercy Kibe, Annette Kaminaka, Minali Prasad, Fares Siddig, Maxwell Singer, Wendy Wong, Qiao Jin, Tiarnan D. L. Keenan, Xia Hu, Emily Y. Chew, Zhiyong Lu, Hua Xu, Ron A. Adelman, Yih-Chung Tham, Qingyu Chen

Links

Abstract / PDF

Why It Matters For Business

RAG with a focused domain corpus can substantially reduce fabricated citations, improving traceability for consumer health features, but it may require extra tuning to avoid small drops in perceived answer quality.

Summary TLDR

The authors built an ophthalmology-focused Retrieval-Augmented Generation (RAG) pipeline using ~70k domain documents and evaluated it on 100 real consumer health questions. RAG reduced hallucinated references from 45.3% to 18.8% and raised correct references from 20.6% to 54.5%. However, RAG caused a small drop in manual accuracy (3.52→3.23) and completeness (3.47→3.27) while improving evidence attribution (1.86→2.48). Key issues: LLMs sometimes ignore top retrieved documents, select irrelevant items, or fail to synthesize retrieved evidence.

Problem Statement

LLMs can generate fluent but unsupported or fabricated medical citations. RAG (fetching domain documents at inference) promises to ground answers, but its real-world effects on long-form medical answers, evidence selection, and attribution are under-studied in domain-specific settings like ophthalmology.

Main Contribution

Curated an ophthalmology corpus (~70,000 documents) combining journal abstracts, AAO practice patterns, and EyeWiki.

Built a RAG pipeline using 1024-token snippets, text-embedding-ada-002 embeddings, and dense cosine retrieval to augment GPT-3.5.

Systematically evaluated 100 consumer questions and >500 generated references with 10 healthcare professionals on factuality, selection/ranking, attribution, accuracy, and completeness; released code and materials.

Key Findings

RAG greatly increased the share of correct references in LLM outputs.

NumbersCorrect refs: 20.6% → 54.5% (252 vs 277 total refs)

RAG reduced hallucinated references but did not eliminate them.

NumbersHallucinated refs: 45.3% → 18.8%

LLMs only used a portion of top documents retrieved by RAG.

Numbers62.5% of references were from top-10 retrieved; mean rank 4.89

RAG improved evidence attribution but slightly lowered human-rated answer accuracy and completeness.

NumbersEvidence attribution: 1.86→2.48 (P<0.001); Accuracy: 3.52→3.23 (P=0.035); Completeness: 3.47→3.27 (P=0.17)

Using domain-specific resources at scale is feasible and useful for RAG.

Numbers~70,000 ophthalmology documents indexed; system returned top-10 snippets without hitting token limits

Results

Total references (non-RAG vs RAG)

Value252 vs 277 references

BaselineLLM without RAG

Reference correctness (fraction of correct refs)

Value20.6% → 54.5%

BaselineLLM without RAG

Hallucinated references

Value45.3% → 18.8%

BaselineLLM without RAG

Selection of top-10 retrieved docs

Value62.5% selected from top-10; mean rank 4.89 (sd 2.40)

BaselineN/A

Accuracy

Value3.52 → 3.23

BaselineLLM without RAG

Human-rated completeness (1-5)

Value3.47 → 3.27

BaselineLLM without RAG

Evidence attribution (1-5)

Value1.86 → 2.48

BaselineLLM without RAG

Who Should Care

What To Try In 7 Days

Assemble a small domain corpus (guidelines + key abstracts + wiki pages).

Index snippets and generate embeddings with text-embedding-ada-002.

Run a basic RAG loop (top-10 retrieval + GPT-3.5) on 20 representative questions and log citations used by the LLM versus retrieved ranks.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only GPT-3.5 was evaluated; other LLMs may behave differently.
  • Default RAG settings were used; other retrieval models or embeddings might change outcomes.
  • Corpus confined to ophthalmology; general medical or cross-domain corpora may yield different results.

When Not To Use

  • High-stakes clinical decision-making without clinician verification.
  • Tasks requiring guaranteed complete synthesis from all top evidence without further tuning.

Failure Modes

  • LLM ignores top-ranked retrieved documents, leaving hallucinated citations.
  • Irrelevant retrieved documents are incorporated and reduce accuracy/completeness.
  • Residual hallucinated references remain despite retrieval, requiring manual checks.

Core Entities

Models

  • GPT-3.5 (gpt-3.5-turbo-0613)
  • text-embedding-ada-002

Metrics

  • reference factuality percentages
  • Accuracy
  • selection percent from top-10 retrieved
  • average retrieval rank

Datasets

  • Ophthalmology-specific corpus (~70,000 documents: 66,269 PubMed abstracts, 24 AAO pages, 1,494 EyeWi
  • 100 consumer health questions sampled from AAO Ask An Ophthalmologist (20 per major topic)