Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Overview

Decision SnapshotNeeds Validation

The demo shows practical gains from corpus curation on a 20-question clinician set using concrete retrieval steps. Strength is limited by single-domain focus, single judge model, and small sample size.

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 30%

Production readiness: 60%

Novelty: 40%

Authors

Philip DiGiacomo, Haoyang Wang, Jinrui Fang, Yan Leng, W Michael Brode, Ying Ding

Links

Abstract / PDF

Why It Matters For Business

Curation strategy matters. A small, high-quality, curated retrieval corpus can deliver more trustworthy and clinically useful chatbot answers than raw, large-scale literature. This reduces risk (misleading recommendations) and improves user trust, which is critical for clinical product adoption.

Who Should Care

Product Manager ML Engineer CTO

Summary TLDR

For Long COVID clinical questions, a small, expert-curated corpus made of a consensus guideline plus a few high-quality systematic reviews (GS-4) produced more faithful and more comprehensive answers than either the guideline alone or large unfiltered literature (PubMed or the guideline's full reference list). The paper proposes Guide-RAG, an implementation and an LLM-as-a-judge evaluation pipeline using a 20-question clinician dataset (LongCOVID-CQ).

Problem Statement

Large literature sources can overwhelm or mislead clinical chatbots for rapidly evolving, complex diseases. Choosing what to retrieve (corpus curation) strongly affects faithfulness, relevance, and completeness of answers for clinical decision support.

Main Contribution

Guide-RAG: a demo RAG system that tests six corpus curation strategies for Long COVID clinical questions.

Empirical finding that a compact corpus combining an expert guideline with three high-quality systematic reviews (GS-4) yields better overall, faithful, and comprehensive answers than larger, unfiltered corpora on the evaluated questions.

Key Findings

A small curated corpus (guideline + 3 systematic reviews, GS-4) outperformed other corpora on overall quality.

NumbersPairwise win rates 57.5–65% for GS-4 over other configs (overall).

Practical UsePrefer a targeted corpus of guideline + recent systematic reviews for clinical RAG on emerging diseases to improve trustworthy, actionable answers.

Evidence RefResults section; Figure 2

GS-4 ranked highest specifically on faithfulness and comprehensiveness.

NumbersGS-4 beat PubMed and R-110 by ~60% in comprehensiveness comparisons; top in faithfulness (win rates shown in Fig.2).

Practical UseIf you need answers that cite and cover multi-system topics reliably, ground retrieval in curated secondary reviews plus guidelines rather than raw primary literature.

Evidence RefResults section; Figure 2; appendix examples

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall win rate (GS-4 vs others)	57.5–65% wins in pairwise comparisons	Other corpus configurations (G-1, R-110, PM, WS, NR-0)	—	LongCOVID-CQ (20 questions)	GS-4 achieved 57.5–65% pairwise win rates (overall).	Results section; Figure 2
Comprehensiveness (GS-4 vs PM and R-110)	~60% win rate vs PM and R-110	PM and R-110	—	LongCOVID-CQ	GS-4 showed 60% win rate in comprehensiveness comparisons vs larger corpora.	Results section; Figure 2

What To Try In 7 Days

Assemble a compact retrieval corpus: start with a current guideline plus 2–4 recent, high-quality systematic reviews for your domain.

Implement dense retrieval with OpenAI embeddings + FAISS and retrieve top-25 chunks per query to reproduce the paper's setup.

Build a short, expert-written question set (10–30 items) mirroring real user needs and run pairwise LLM-as-a-judge comparisons for faithfulness, relevance, and comprehensiveness.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Single clinical domain (Long COVID) — unknown generalizability to other diseases.

Evaluation used a single LLM judge (GPT-4o) without human rater validation or multi-model consensus.

When Not To Use

Do not rely on GS-4-like small corpora when up-to-the-minute primary research is required (e.g., new trial results published after the curated reviews).

Avoid using LLM-as-a-judge scores as sole validation for high-stakes clinical deployment without human expert review.

Failure Modes

Overconfidence from single studies when using large unfiltered corpora (PubMed) — can recommend interventions experts caution against.

Missing recent evidence if curated reviews or guideline are out-of-date.

Core Entities

Models

GPT-4oOpenAI text-embedding-3-small

Metrics

Win rate (pairwise, averaged)FaithfulnessRelevanceComprehensivenessOverall (equal-weight aggregate)

Datasets

LongCOVID-CQ (20 expert clinician questions)

Benchmarks

Pairwise LLM-as-a-judge comparisons (faithfulness, relevance, comprehensiveness)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A small curated corpus (guideline + 3 systematic reviews, GS-4) outperformed other corpora on overall quality.

GS-4 ranked highest specifically on faithfulness and comprehensiveness.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Mask untruthful parts of context to cut hallucinations and keep helpful facts

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

Key finding