Dual-source iterative retrieval (EHR + corpus) that pulls factual facts into RAG to boost medical QA on long clinical notes

Overview

Decision SnapshotNeeds Validation

Arguments are backed by multi-dataset accuracy gains and ablations, but results are limited to zero-shot LLMs, a specific dense retriever, and three benchmarks; expect further validation before clinical deployment.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou

Links

Abstract / PDF / Code

Why It Matters For Business

RGAR boosts answer accuracy on tasks involving long clinical notes while keeping inference cost lower than heavier iterative RAGs, so teams can get clinically stronger retrieval without scaling model size.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

RGAR is a retrieval-augmented generation (RAG) method that alternates retrieving conceptual documents from a medical corpus and extracting factual spans from a patient's EHR. The system iteratively updates queries so factual and conceptual knowledge refine each other. Across three medical multiple-choice QA benchmarks, RGAR raises average accuracy substantially versus standard RAG and query-generation baselines, gives the biggest gains on long EHR contexts, and is faster than some iterative medical RAG systems.

Problem Statement

Current RAG methods retrieve documents without distinguishing factual details from conceptual knowledge. In medical QA, long electronic health records (EHRs) contain mostly irrelevant text for a specific question. This dilutes retrieval relevance and harms downstream answers. The problem: how to retrieve both factual EHR details and relevant conceptual documents and let them improve each other.

Main Contribution

A simple recurrent pipeline (RGAR) that alternates: generate multi-queries → retrieve corpus chunks → use retrieved concepts to extract and summarize factual spans from EHR → update queries and repeat.

A dual-source design that treats EHR factual extraction and textbook/corpus conceptual retrieval as interactive steps, improving retrieval for long EHRs.

Key Findings

RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.

NumbersAvg accuracy +11.91% over Custom baseline

Practical UseUse RGAR when adding retrieval to a general LLM: it delivers a substantial accuracy uplift across mixed medical QA tasks.

Evidence RefTable 2, Sec. 4.2.1

On the long-EHR benchmark (EHRNoteQA), RGAR yields a large boost versus query-generation RAG.

NumbersEHRNoteQA +7.8% over GAR

Practical UseIf you handle long clinical notes, add factual extraction + recurrence: this step gives the biggest improvement.

Evidence RefSec. 4.2.1, Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	61.04%	Custom (no retrieval) 49.13%	+11.91%	Average (MedQA-USMLE, MedMCQA, EHRNoteQA)	Table 2 shows RGAR average 61.04% vs Custom 49.13%	Table 2
Accuracy	73.28%	GAR 65.48%	+7.8%	EHRNoteQA	Sec. 4.2.1 reports a 7.8% improvement over GAR on EHRNoteQA	Sec. 4.2.1, Table 2

What To Try In 7 Days

Plug RGAR recurrence (extract factual spans from EHR, then re-run retrieval) into an existing RAG pipeline and test on your EHR samples.

Use multi-query generation (3 queries) for retrieval and average similarity scores to stabilize results.

Benchmark inference time and accuracy vs your current iterative RAG; measure cost per query to decide rollout.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/RGAR-C613

Risks & Boundaries

Limitations

Time complexity grows with corpus size; RGAR still needs corpus retrieval at each round (Sec. Limitations).

Effectiveness depends on LLM instruction-following and large context windows; small models may not benefit (Sec. 4.2.2, Limitations).

When Not To Use

When inference cost must be minimal and you cannot afford retrieval over a large corpus.

With very small LLMs (≤1.5B) that cannot leverage retrieved context as shown in experiments.

Failure Modes

Adding large retrieved contexts can degrade numerical/arithmetic reasoning (observed on MedMCQA where ~7% arithmetic questions caused performance drop).

If EHRs exceed LLM context limits, factual extraction assumptions break and chunk-free methods are needed.

Core Entities

Models

Llama-3.1-8B-InstructLlama-3.2-3B-InstructQwen2.5-1.5B-InstructQwen2.5-3B-InstructQwen2.5-7B-Instruct

Metrics

Accuracy

Datasets

EHRNoteQAMedQA-USMLEMedMCQATextbooks (corpus)MIMIC-IV (source for EHRNoteQA)

Benchmarks

EHRNoteQAMedQA-USMLEMedMCQA

Context Entities

Models

GPT-3.5-turbo (RAG baseline referenced)

Datasets

PubMedQA (mentioned)BioASQ-Y/N (mentioned)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.

On the long-EHR benchmark (EHRNoteQA), RGAR yields a large boost versus query-generation RAG.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

MTRAG: a human-made benchmark of multi-turn RAG conversations that stresses retrieval, unanswerables, and later-turn context.

Key finding

Atomic fact-checking for medical RAG LLMs boosts factuality and traceability

Key finding

Build query-specific evidence graphs on the fly to fix missing links and filter distractor facts

Key finding

RAGLAB — an open, modular toolkit to reproduce, compare and develop RAG algorithms fairly

Key finding

InsQABench: a Chinese insurance QA benchmark plus SQL-ReAct and RAG-ReAct methods

Key finding