Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
RAG with a focused domain corpus can substantially reduce fabricated citations, improving traceability for consumer health features, but it may require extra tuning to avoid small drops in perceived answer quality.
Summary TLDR
The authors built an ophthalmology-focused Retrieval-Augmented Generation (RAG) pipeline using ~70k domain documents and evaluated it on 100 real consumer health questions. RAG reduced hallucinated references from 45.3% to 18.8% and raised correct references from 20.6% to 54.5%. However, RAG caused a small drop in manual accuracy (3.52→3.23) and completeness (3.47→3.27) while improving evidence attribution (1.86→2.48). Key issues: LLMs sometimes ignore top retrieved documents, select irrelevant items, or fail to synthesize retrieved evidence.
Problem Statement
LLMs can generate fluent but unsupported or fabricated medical citations. RAG (fetching domain documents at inference) promises to ground answers, but its real-world effects on long-form medical answers, evidence selection, and attribution are under-studied in domain-specific settings like ophthalmology.
Main Contribution
Curated an ophthalmology corpus (~70,000 documents) combining journal abstracts, AAO practice patterns, and EyeWiki.
Built a RAG pipeline using 1024-token snippets, text-embedding-ada-002 embeddings, and dense cosine retrieval to augment GPT-3.5.
Systematically evaluated 100 consumer questions and >500 generated references with 10 healthcare professionals on factuality, selection/ranking, attribution, accuracy, and completeness; released code and materials.
Key Findings
RAG greatly increased the share of correct references in LLM outputs.
RAG reduced hallucinated references but did not eliminate them.
LLMs only used a portion of top documents retrieved by RAG.
RAG improved evidence attribution but slightly lowered human-rated answer accuracy and completeness.
Using domain-specific resources at scale is feasible and useful for RAG.
Results
Total references (non-RAG vs RAG)
Reference correctness (fraction of correct refs)
Hallucinated references
Selection of top-10 retrieved docs
Accuracy
Human-rated completeness (1-5)
Evidence attribution (1-5)
Who Should Care
What To Try In 7 Days
Assemble a small domain corpus (guidelines + key abstracts + wiki pages).
Index snippets and generate embeddings with text-embedding-ada-002.
Run a basic RAG loop (top-10 retrieval + GPT-3.5) on 20 representative questions and log citations used by the LLM versus retrieved ranks.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only GPT-3.5 was evaluated; other LLMs may behave differently.
- Default RAG settings were used; other retrieval models or embeddings might change outcomes.
- Corpus confined to ophthalmology; general medical or cross-domain corpora may yield different results.
When Not To Use
- High-stakes clinical decision-making without clinician verification.
- Tasks requiring guaranteed complete synthesis from all top evidence without further tuning.
Failure Modes
- LLM ignores top-ranked retrieved documents, leaving hallucinated citations.
- Irrelevant retrieved documents are incorporated and reduce accuracy/completeness.
- Residual hallucinated references remain despite retrieval, requiring manual checks.
Core Entities
Models
- GPT-3.5 (gpt-3.5-turbo-0613)
- text-embedding-ada-002
Metrics
- reference factuality percentages
- Accuracy
- selection percent from top-10 retrieved
- average retrieval rank
Datasets
- Ophthalmology-specific corpus (~70,000 documents: 66,269 PubMed abstracts, 24 AAO pages, 1,494 EyeWi
- 100 consumer health questions sampled from AAO Ask An Ophthalmologist (20 per major topic)

