Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
RGAR boosts answer accuracy on tasks involving long clinical notes while keeping inference cost lower than heavier iterative RAGs, so teams can get clinically stronger retrieval without scaling model size.
Summary TLDR
RGAR is a retrieval-augmented generation (RAG) method that alternates retrieving conceptual documents from a medical corpus and extracting factual spans from a patient's EHR. The system iteratively updates queries so factual and conceptual knowledge refine each other. Across three medical multiple-choice QA benchmarks, RGAR raises average accuracy substantially versus standard RAG and query-generation baselines, gives the biggest gains on long EHR contexts, and is faster than some iterative medical RAG systems.
Problem Statement
Current RAG methods retrieve documents without distinguishing factual details from conceptual knowledge. In medical QA, long electronic health records (EHRs) contain mostly irrelevant text for a specific question. This dilutes retrieval relevance and harms downstream answers. The problem: how to retrieve both factual EHR details and relevant conceptual documents and let them improve each other.
Main Contribution
A simple recurrent pipeline (RGAR) that alternates: generate multi-queries → retrieve corpus chunks → use retrieved concepts to extract and summarize factual spans from EHR → update queries and repeat.
A dual-source design that treats EHR factual extraction and textbook/corpus conceptual retrieval as interactive steps, improving retrieval for long EHRs.
Extensive evaluation on three medical QA benchmarks showing consistent accuracy gains and better efficiency than an existing iterative medical RAG method.
Key Findings
RGAR improves average accuracy across three factual-aware medical QA benchmarks compared with the non-retrieval baseline.
On the long-EHR benchmark (EHRNoteQA), RGAR yields a large boost versus query-generation RAG.
A well-optimized retrieval pipeline with a smaller model can beat a larger proprietary RAG model.
RGAR runs substantially faster than a competing iterative medical RAG system at evaluation time.
Results
Accuracy
Accuracy
Accuracy
Inference time on EHRNoteQA
Who Should Care
What To Try In 7 Days
Plug RGAR recurrence (extract factual spans from EHR, then re-run retrieval) into an existing RAG pipeline and test on your EHR samples.
Use multi-query generation (3 queries) for retrieval and average similarity scores to stabilize results.
Benchmark inference time and accuracy vs your current iterative RAG; measure cost per query to decide rollout.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Time complexity grows with corpus size; RGAR still needs corpus retrieval at each round (Sec. Limitations).
- Effectiveness depends on LLM instruction-following and large context windows; small models may not benefit (Sec. 4.2.2, Limitations).
- Multi-round recurrence can create multi-hop facts that may lead to over-inference; extra rounds showed no clear benefit (Sec. 4.3).
When Not To Use
- When inference cost must be minimal and you cannot afford retrieval over a large corpus.
- With very small LLMs (≤1.5B) that cannot leverage retrieved context as shown in experiments.
- When ground-truth retrieval targets are known and a single-shot, targeted retrieval suffices.
Failure Modes
- Adding large retrieved contexts can degrade numerical/arithmetic reasoning (observed on MedMCQA where ~7% arithmetic questions caused performance drop).
- If EHRs exceed LLM context limits, factual extraction assumptions break and chunk-free methods are needed.
- Incorrectly extracted factual spans can misguide subsequent retrieval and increase hallucination risk.
Core Entities
Models
- Llama-3.1-8B-Instruct
- Llama-3.2-3B-Instruct
- Qwen2.5-1.5B-Instruct
- Qwen2.5-3B-Instruct
- Qwen2.5-7B-Instruct
Metrics
- Accuracy
Datasets
- EHRNoteQA
- MedQA-USMLE
- MedMCQA
- Textbooks (corpus)
- MIMIC-IV (source for EHRNoteQA)
Benchmarks
- EHRNoteQA
- MedQA-USMLE
- MedMCQA
Context Entities
Models
- GPT-3.5-turbo (RAG baseline referenced)
Datasets
- PubMedQA (mentioned)
- BioASQ-Y/N (mentioned)

