Dual-source iterative retrieval (EHR + corpus) that pulls factual facts into RAG to boost medical QA on long clinical notes

February 19, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou

Links

Abstract / PDF

Why It Matters For Business

RGAR boosts answer accuracy on tasks involving long clinical notes while keeping inference cost lower than heavier iterative RAGs, so teams can get clinically stronger retrieval without scaling model size.

Summary TLDR

RGAR is a retrieval-augmented generation (RAG) method that alternates retrieving conceptual documents from a medical corpus and extracting factual spans from a patient's EHR. The system iteratively updates queries so factual and conceptual knowledge refine each other. Across three medical multiple-choice QA benchmarks, RGAR raises average accuracy substantially versus standard RAG and query-generation baselines, gives the biggest gains on long EHR contexts, and is faster than some iterative medical RAG systems.

Problem Statement

Current RAG methods retrieve documents without distinguishing factual details from conceptual knowledge. In medical QA, long electronic health records (EHRs) contain mostly irrelevant text for a specific question. This dilutes retrieval relevance and harms downstream answers. The problem: how to retrieve both factual EHR details and relevant conceptual documents and let them improve each other.

Main Contribution

A simple recurrent pipeline (RGAR) that alternates: generate multi-queries → retrieve corpus chunks → use retrieved concepts to extract and summarize factual spans from EHR → update queries and repeat.

A dual-source design that treats EHR factual extraction and textbook/corpus conceptual retrieval as interactive steps, improving retrieval for long EHRs.

Extensive evaluation on three medical QA benchmarks showing consistent accuracy gains and better efficiency than an existing iterative medical RAG method.

Key Findings

RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.

NumbersAvg accuracy +11.91% over Custom baseline

On the long-EHR benchmark (EHRNoteQA), RGAR yields a large boost versus query-generation RAG.

NumbersEHRNoteQA +7.8% over GAR

A well-optimized retrieval pipeline with a smaller model can beat a larger proprietary RAG model.

NumbersLlama-3.1-8B-Instruct+RGAR 69.52% vs GPT-3.5-RAG 66.22%

RGAR runs substantially faster than a competing iterative medical RAG system at evaluation time.

NumbersRGAR 6h vs i-MedRAG 22h on EHRNoteQA

Results

Accuracy

Value61.04%

BaselineCustom (no retrieval) 49.13%

Accuracy

Value73.28%

BaselineGAR 65.48%

Accuracy

Value69.52%

BaselineReported GPT-3.5 RAG 66.22%

Inference time on EHRNoteQA

Value6 hours

Baselinei-MedRAG 22 hours

Who Should Care

What To Try In 7 Days

Plug RGAR recurrence (extract factual spans from EHR, then re-run retrieval) into an existing RAG pipeline and test on your EHR samples.

Use multi-query generation (3 queries) for retrieval and average similarity scores to stabilize results.

Benchmark inference time and accuracy vs your current iterative RAG; measure cost per query to decide rollout.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Time complexity grows with corpus size; RGAR still needs corpus retrieval at each round (Sec. Limitations).
  • Effectiveness depends on LLM instruction-following and large context windows; small models may not benefit (Sec. 4.2.2, Limitations).
  • Multi-round recurrence can create multi-hop facts that may lead to over-inference; extra rounds showed no clear benefit (Sec. 4.3).

When Not To Use

  • When inference cost must be minimal and you cannot afford retrieval over a large corpus.
  • With very small LLMs (≤1.5B) that cannot leverage retrieved context as shown in experiments.
  • When ground-truth retrieval targets are known and a single-shot, targeted retrieval suffices.

Failure Modes

  • Adding large retrieved contexts can degrade numerical/arithmetic reasoning (observed on MedMCQA where ~7% arithmetic questions caused performance drop).
  • If EHRs exceed LLM context limits, factual extraction assumptions break and chunk-free methods are needed.
  • Incorrectly extracted factual spans can misguide subsequent retrieval and increase hallucination risk.

Core Entities

Models

  • Llama-3.1-8B-Instruct
  • Llama-3.2-3B-Instruct
  • Qwen2.5-1.5B-Instruct
  • Qwen2.5-3B-Instruct
  • Qwen2.5-7B-Instruct

Metrics

  • Accuracy

Datasets

  • EHRNoteQA
  • MedQA-USMLE
  • MedMCQA
  • Textbooks (corpus)
  • MIMIC-IV (source for EHRNoteQA)

Benchmarks

  • EHRNoteQA
  • MedQA-USMLE
  • MedMCQA

Context Entities

Models

  • GPT-3.5-turbo (RAG baseline referenced)

Datasets

  • PubMedQA (mentioned)
  • BioASQ-Y/N (mentioned)