Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
Curation strategy matters. A small, high-quality, curated retrieval corpus can deliver more trustworthy and clinically useful chatbot answers than raw, large-scale literature. This reduces risk (misleading recommendations) and improves user trust, which is critical for clinical product adoption.
Summary TLDR
For Long COVID clinical questions, a small, expert-curated corpus made of a consensus guideline plus a few high-quality systematic reviews (GS-4) produced more faithful and more comprehensive answers than either the guideline alone or large unfiltered literature (PubMed or the guideline's full reference list). The paper proposes Guide-RAG, an implementation and an LLM-as-a-judge evaluation pipeline using a 20-question clinician dataset (LongCOVID-CQ).
Problem Statement
Large literature sources can overwhelm or mislead clinical chatbots for rapidly evolving, complex diseases. Choosing what to retrieve (corpus curation) strongly affects faithfulness, relevance, and completeness of answers for clinical decision support.
Main Contribution
Guide-RAG: a demo RAG system that tests six corpus curation strategies for Long COVID clinical questions.
Empirical finding that a compact corpus combining an expert guideline with three high-quality systematic reviews (GS-4) yields better overall, faithful, and comprehensive answers than larger, unfiltered corpora on the evaluated questions.
LongCOVID-CQ: a 20-question, expert-written dataset targeting clinician information needs (diagnosis, management, mechanisms) and an LLM-as-a-judge evaluation protocol for faithfulness, relevance, and comprehensiveness.
Key Findings
A small curated corpus (guideline + 3 systematic reviews, GS-4) outperformed other corpora on overall quality.
GS-4 ranked highest specifically on faithfulness and comprehensiveness.
PubMed-style large literature gave slightly better relevance but worse faithfulness/comprehensiveness tradeoffs.
Results
Overall win rate (GS-4 vs others)
Comprehensiveness (GS-4 vs PM and R-110)
Relevance (PM performance)
Guideline vs guideline references (G-1 vs R-110)
Who Should Care
What To Try In 7 Days
Assemble a compact retrieval corpus: start with a current guideline plus 2–4 recent, high-quality systematic reviews for your domain.
Implement dense retrieval with OpenAI embeddings + FAISS and retrieve top-25 chunks per query to reproduce the paper's setup.
Build a short, expert-written question set (10–30 items) mirroring real user needs and run pairwise LLM-as-a-judge comparisons for faithfulness, relevance, and comprehensiveness.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Single clinical domain (Long COVID) — unknown generalizability to other diseases.
- Evaluation used a single LLM judge (GPT-4o) without human rater validation or multi-model consensus.
- Small test set: LongCOVID-CQ has 20 expert questions, limiting statistical power.
- Different retrieval pipelines across corpora (dense for small corpora, hybrid for PubMed) may bias comparisons.
- Preprocessing exclusions (2 guideline references omitted) reduce exact reproducibility of R-110 baseline.
When Not To Use
- Do not rely on GS-4-like small corpora when up-to-the-minute primary research is required (e.g., new trial results published after the curated reviews).
- Avoid using LLM-as-a-judge scores as sole validation for high-stakes clinical deployment without human expert review.
Failure Modes
- Overconfidence from single studies when using large unfiltered corpora (PubMed) — can recommend interventions experts caution against.
- Missing recent evidence if curated reviews or guideline are out-of-date.
- Judge-model bias: GPT-4o evaluations may reflect model preferences and produce optimistic win rates for some configurations.
Core Entities
Models
- GPT-4o
- OpenAI text-embedding-3-small
Metrics
- Win rate (pairwise, averaged)
- Faithfulness
- Relevance
- Comprehensiveness
- Overall (equal-weight aggregate)
Datasets
- LongCOVID-CQ (20 expert clinician questions)
Benchmarks
- Pairwise LLM-as-a-judge comparisons (faithfulness, relevance, comprehensiveness)
Context Entities
Models
- GPT-4o web-search (WS for web baseline)
Datasets
- PubMed corpus (PM)
- AAPM&R guideline (G-1)
- Guideline references (R-110)

