Overview
The study gives concrete evidence that domain-specific RAG reduces fabricated citations and improves traceability, but the approach requires tuning (retrieval, prompting, and synthesis) to avoid harming answer quality.
Citations6
Evidence Strength0.80
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 6/7
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
RAG with a focused domain corpus can substantially reduce fabricated citations, improving traceability for consumer health features, but it may require extra tuning to avoid small drops in perceived answer quality.
Who Should Care
Summary TLDR
The authors built an ophthalmology-focused Retrieval-Augmented Generation (RAG) pipeline using ~70k domain documents and evaluated it on 100 real consumer health questions. RAG reduced hallucinated references from 45.3% to 18.8% and raised correct references from 20.6% to 54.5%. However, RAG caused a small drop in manual accuracy (3.52→3.23) and completeness (3.47→3.27) while improving evidence attribution (1.86→2.48). Key issues: LLMs sometimes ignore top retrieved documents, select irrelevant items, or fail to synthesize retrieved evidence.
Problem Statement
LLMs can generate fluent but unsupported or fabricated medical citations. RAG (fetching domain documents at inference) promises to ground answers, but its real-world effects on long-form medical answers, evidence selection, and attribution are under-studied in domain-specific settings like ophthalmology.
Main Contribution
Curated an ophthalmology corpus (~70,000 documents) combining journal abstracts, AAO practice patterns, and EyeWiki.
Built a RAG pipeline using 1024-token snippets, text-embedding-ada-002 embeddings, and dense cosine retrieval to augment GPT-3.5.
Key Findings
RAG greatly increased the share of correct references in LLM outputs.
RAG reduced hallucinated references but did not eliminate them.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Total references (non-RAG vs RAG) | 252 vs 277 references | LLM without RAG | RAG +25 refs | 100 questions (all topics) | Results section counts | — |
| Reference correctness (fraction of correct refs) | 20.6% → 54.5% | LLM without RAG | +33.9 pp | 100 questions | Fig 2A and Results | — |
What To Try In 7 Days
Assemble a small domain corpus (guidelines + key abstracts + wiki pages).
Index snippets and generate embeddings with text-embedding-ada-002.
Run a basic RAG loop (top-10 retrieval + GPT-3.5) on 20 representative questions and log citations used by the LLM versus retrieved ranks.
Reproducibility
Risks & Boundaries
Limitations
Only GPT-3.5 was evaluated; other LLMs may behave differently.
Default RAG settings were used; other retrieval models or embeddings might change outcomes.
When Not To Use
High-stakes clinical decision-making without clinician verification.
Tasks requiring guaranteed complete synthesis from all top evidence without further tuning.
Failure Modes
LLM ignores top-ranked retrieved documents, leaving hallucinated citations.
Irrelevant retrieved documents are incorporated and reduce accuracy/completeness.

