Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

September 20, 20248 min

Overview

Decision SnapshotReady For Pilot

The study gives concrete evidence that domain-specific RAG reduces fabricated citations and improves traceability, but the approach requires tuning (retrieval, prompting, and synthesis) to avoid harming answer quality.

Citations6

Evidence Strength0.80

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Aidan Gilson, Xuguang Ai, Thilaka Arunachalam, Ziyou Chen, Ki Xiong Cheong, Amisha Dave, Cameron Duic, Mercy Kibe, Annette Kaminaka, Minali Prasad, Fares Siddig, Maxwell Singer, Wendy Wong, Qiao Jin, Tiarnan D. L. Keenan, Xia Hu, Emily Y. Chew, Zhiyong Lu, Hua Xu, Ron A. Adelman, Yih-Chung Tham, Qingyu Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

RAG with a focused domain corpus can substantially reduce fabricated citations, improving traceability for consumer health features, but it may require extra tuning to avoid small drops in perceived answer quality.

Who Should Care

Summary TLDR

The authors built an ophthalmology-focused Retrieval-Augmented Generation (RAG) pipeline using ~70k domain documents and evaluated it on 100 real consumer health questions. RAG reduced hallucinated references from 45.3% to 18.8% and raised correct references from 20.6% to 54.5%. However, RAG caused a small drop in manual accuracy (3.52→3.23) and completeness (3.47→3.27) while improving evidence attribution (1.86→2.48). Key issues: LLMs sometimes ignore top retrieved documents, select irrelevant items, or fail to synthesize retrieved evidence.

Problem Statement

LLMs can generate fluent but unsupported or fabricated medical citations. RAG (fetching domain documents at inference) promises to ground answers, but its real-world effects on long-form medical answers, evidence selection, and attribution are under-studied in domain-specific settings like ophthalmology.

Main Contribution

Curated an ophthalmology corpus (~70,000 documents) combining journal abstracts, AAO practice patterns, and EyeWiki.

Built a RAG pipeline using 1024-token snippets, text-embedding-ada-002 embeddings, and dense cosine retrieval to augment GPT-3.5.

Key Findings

RAG greatly increased the share of correct references in LLM outputs.

NumbersCorrect refs: 20.6%54.5% (252 vs 277 total refs)

Practical UseUse a domain corpus and RAG to cut fabricated citations by half or more when publishing LLM-supported medical answers.

Evidence RefResults section; Fig 2A; counts in text

RAG reduced hallucinated references but did not eliminate them.

NumbersHallucinated refs: 45.3%18.8%

Practical UseExpect residual hallucinations: always validate citations manually for clinical or consumer health use.

Evidence RefResults section; Fig 2A

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Total references (non-RAG vs RAG)252 vs 277 referencesLLM without RAGRAG +25 refs100 questions (all topics)Results section counts
Reference correctness (fraction of correct refs)20.6%54.5%LLM without RAG+33.9 pp100 questionsFig 2A and Results

What To Try In 7 Days

Assemble a small domain corpus (guidelines + key abstracts + wiki pages).

Index snippets and generate embeddings with text-embedding-ada-002.

Run a basic RAG loop (top-10 retrieval + GPT-3.5) on 20 representative questions and log citations used by the LLM versus retrieved ranks.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only GPT-3.5 was evaluated; other LLMs may behave differently.

Default RAG settings were used; other retrieval models or embeddings might change outcomes.

When Not To Use

High-stakes clinical decision-making without clinician verification.

Tasks requiring guaranteed complete synthesis from all top evidence without further tuning.

Failure Modes

LLM ignores top-ranked retrieved documents, leaving hallucinated citations.

Irrelevant retrieved documents are incorporated and reduce accuracy/completeness.

Core Entities

Models

GPT-3.5 (gpt-3.5-turbo-0613)text-embedding-ada-002

Metrics

reference factuality percentagesAccuracyselection percent from top-10 retrievedaverage retrieval rank

Datasets

Ophthalmology-specific corpus (~70,000 documents: 66,269 PubMed abstracts, 24 AAO pages, 1,494 EyeWi100 consumer health questions sampled from AAO Ask An Ophthalmologist (20 per major topic)