Dual-source iterative retrieval (EHR + corpus) that pulls factual facts into RAG to boost medical QA on long clinical notes

February 19, 20257 min

Overview

Decision SnapshotNeeds Validation

Arguments are backed by multi-dataset accuracy gains and ablations, but results are limited to zero-shot LLMs, a specific dense retriever, and three benchmarks; expect further validation before clinical deployment.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou

Links

Abstract / PDF / Code

Why It Matters For Business

RGAR boosts answer accuracy on tasks involving long clinical notes while keeping inference cost lower than heavier iterative RAGs, so teams can get clinically stronger retrieval without scaling model size.

Who Should Care

Summary TLDR

RGAR is a retrieval-augmented generation (RAG) method that alternates retrieving conceptual documents from a medical corpus and extracting factual spans from a patient's EHR. The system iteratively updates queries so factual and conceptual knowledge refine each other. Across three medical multiple-choice QA benchmarks, RGAR raises average accuracy substantially versus standard RAG and query-generation baselines, gives the biggest gains on long EHR contexts, and is faster than some iterative medical RAG systems.

Problem Statement

Current RAG methods retrieve documents without distinguishing factual details from conceptual knowledge. In medical QA, long electronic health records (EHRs) contain mostly irrelevant text for a specific question. This dilutes retrieval relevance and harms downstream answers. The problem: how to retrieve both factual EHR details and relevant conceptual documents and let them improve each other.

Main Contribution

A simple recurrent pipeline (RGAR) that alternates: generate multi-queries → retrieve corpus chunks → use retrieved concepts to extract and summarize factual spans from EHR → update queries and repeat.

A dual-source design that treats EHR factual extraction and textbook/corpus conceptual retrieval as interactive steps, improving retrieval for long EHRs.

Key Findings

RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.

NumbersAvg accuracy +11.91% over Custom baseline

Practical UseUse RGAR when adding retrieval to a general LLM: it delivers a substantial accuracy uplift across mixed medical QA tasks.

Evidence RefTable 2, Sec. 4.2.1

On the long-EHR benchmark (EHRNoteQA), RGAR yields a large boost versus query-generation RAG.

NumbersEHRNoteQA +7.8% over GAR

Practical UseIf you handle long clinical notes, add factual extraction + recurrence: this step gives the biggest improvement.

Evidence RefSec. 4.2.1, Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy61.04%Custom (no retrieval) 49.13%+11.91%Average (MedQA-USMLE, MedMCQA, EHRNoteQA)Table 2 shows RGAR average 61.04% vs Custom 49.13%Table 2
Accuracy73.28%GAR 65.48%+7.8%EHRNoteQASec. 4.2.1 reports a 7.8% improvement over GAR on EHRNoteQASec. 4.2.1, Table 2

What To Try In 7 Days

Plug RGAR recurrence (extract factual spans from EHR, then re-run retrieval) into an existing RAG pipeline and test on your EHR samples.

Use multi-query generation (3 queries) for retrieval and average similarity scores to stabilize results.

Benchmark inference time and accuracy vs your current iterative RAG; measure cost per query to decide rollout.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Time complexity grows with corpus size; RGAR still needs corpus retrieval at each round (Sec. Limitations).

Effectiveness depends on LLM instruction-following and large context windows; small models may not benefit (Sec. 4.2.2, Limitations).

When Not To Use

When inference cost must be minimal and you cannot afford retrieval over a large corpus.

With very small LLMs (≤1.5B) that cannot leverage retrieved context as shown in experiments.

Failure Modes

Adding large retrieved contexts can degrade numerical/arithmetic reasoning (observed on MedMCQA where ~7% arithmetic questions caused performance drop).

If EHRs exceed LLM context limits, factual extraction assumptions break and chunk-free methods are needed.

Core Entities

Models

Llama-3.1-8B-InstructLlama-3.2-3B-InstructQwen2.5-1.5B-InstructQwen2.5-3B-InstructQwen2.5-7B-Instruct

Metrics

Accuracy

Datasets

EHRNoteQAMedQA-USMLEMedMCQATextbooks (corpus)MIMIC-IV (source for EHRNoteQA)

Benchmarks

EHRNoteQAMedQA-USMLEMedMCQA

Context Entities

Models

GPT-3.5-turbo (RAG baseline referenced)

Datasets

PubMedQA (mentioned)BioASQ-Y/N (mentioned)