Clinical-note Q&A by RAG: Wizard Vicuna gives high accuracy; quantization cuts latency ~48x

Overview

Decision SnapshotNeeds Validation

Promising engineering demo: RAG + quantization shows clear practical gains, but small, manual evaluations and reliance on GPT-4 as a reference make evidence preliminary, so expect additional validation before production use.

Citations1

Evidence Strength0.35

Confidence0.70

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 45%

Novelty: 40%

Authors

Ran Elgedawy, Ioana Danciu, Maria Mahbub, Sudarshan Srinivasan

Links

Abstract / PDF / Data

Why It Matters For Business

RAG lets teams extract factual details from clinical notes without costly model re-training; quantization makes high-capacity models usable in production by cutting latency and GPU cost.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

The authors build a Retrieval-Augmented-Generation (RAG) chatbot over MIMIC clinical notes using LangChain, SentenceTransformers embeddings, and several open-source LLMs. Wizard Vicuna (13B) paired with SentenceTransformers gave the best accuracy in their tests (80% single-doc; 100% on a small multi-doc comparison vs GPT-4) but was very slow. Post-training weight quantization cut average latency from minutes to ~7.6s and reduced GPU memory use (17.56GB -> 11.93GB). A small QLoRA fine-tune on 1,250 QA pairs performed poorly and produced hallucinations. Evaluations are small and rely on manual checks and GPT-4 as a reference, so results are preliminary.

Problem Statement

Clinical notes hold critical patient facts but are long and unstructured. Clinicians and researchers need a fast, conversational way to pull exact details from notes without expensive model fine-tuning.

Main Contribution

A working RAG-based conversational system (LangChain + vector DB) for querying clinical notes.

Empirical comparison of multiple embedding models and open-source LLMs on clinical-note Q&A.

Key Findings

Wizard Vicuna (13B) + SentenceTransformers reached top single-document accuracy

Numbers80% accuracy (single-doc eval, 5 QA pairs)

Practical UseUse a strong open-source 13B LLM with semantic embeddings when accuracy matters; expect high GPU cost.

Evidence RefFigure 2; Section 5.1.2

Wizard Vicuna matched GPT-4 outputs on a small multi-document test

Numbers100% accuracy vs GPT-4 on reported multi-doc examples

Practical UseRAG with a high-capacity LLM can reproduce GPT-4–style answers on limited cases, but check latency and coverage first.

Evidence RefSection 5.3; Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	80%	—	—	5 manual QA pairs from MIMIC	Wizard Vicuna + SentenceTransformers top single-doc pairing	Section 5.1.2; Figure 2
Accuracy	100% (Wizard Vicuna)	60% (Flan T5)	+40 pp	multi-doc synthetic/MIMIC examples	Wizard Vicuna matched GPT-4 answers on reported examples; Flan T5 was lower	Section 5.3; Table 5

What To Try In 7 Days

Build a small LangChain RAG pipeline over a deidentified notes subset and test semantic embeddings.

Compare a 3B model vs a 13B open-source model on a few important queries to measure latency vs accuracy trade-offs.

Apply post-training 8/16-bit quantization and measure latency and GPU memory before investing in larger infra.

Agent Features

Memory

short-term conversation history

Tool Use

LangChainvector DB retrieverSentenceTransformers embeddings

Frameworks

LangChain

Architectures

transformer LLMs

Optimization Features

Infra Optimization

trade-offs noted due to GPU RAM constraints

Model Optimization

post-training weight quantizationLoRA

System Optimization

chunking with overlap to respect context windows

Training Optimization

LoRA

Inference Optimization

quantization reduced latency ~48xselecting smaller models (3B) for speed

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://physionet.org/content/mimiciv/2.2/

Risks & Boundaries

Limitations

Very small evaluation: primary accuracy claims come from 5 manual QA pairs or limited synthetic tests.

Used GPT-4 as a reference in some comparisons; GPT-4 itself can hallucinate.

When Not To Use

For high-stakes clinical decision making without human oversight.

When you lack compute resources for large models and cannot quantize safely.

Failure Modes

Model hallucination: confident but incorrect answers after fine-tuning or generation.

Excessive latency with large models making real-time use impractical without quantization.

Core Entities

Models

Wizard Vicuna (13B)Vicuna (13B)RedPajama-Chat (7B)Alpaca v1/v2 (7B/13B)Med Alpaca (7B)GPT-4 x Alpaca (13B)FastChat - T5 (3B)Flan T5 xl (3B)LexPodLM (13B)

Metrics

Accuracyinference_time_secondsgpu_memory_gbaverage_latency_seconds

Datasets

MIMIC-IVMIMIC-IV-NoteAnnotated QA pairs (1250, from prior work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Wizard Vicuna (13B) + SentenceTransformers reached top single-document accuracy

Wizard Vicuna matched GPT-4 outputs on a small multi-document test

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding