Clinical-note Q&A by RAG: Wizard Vicuna gives high accuracy; quantization cuts latency ~48x

January 19, 20247 min

Overview

Decision SnapshotNeeds Validation

Promising engineering demo: RAG + quantization shows clear practical gains, but small, manual evaluations and reliance on GPT-4 as a reference make evidence preliminary, so expect additional validation before production use.

Citations1

Evidence Strength0.35

Confidence0.70

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 45%

Novelty: 40%

Authors

Ran Elgedawy, Ioana Danciu, Maria Mahbub, Sudarshan Srinivasan

Links

Abstract / PDF / Data

Why It Matters For Business

RAG lets teams extract factual details from clinical notes without costly model re-training; quantization makes high-capacity models usable in production by cutting latency and GPU cost.

Who Should Care

Summary TLDR

The authors build a Retrieval-Augmented-Generation (RAG) chatbot over MIMIC clinical notes using LangChain, SentenceTransformers embeddings, and several open-source LLMs. Wizard Vicuna (13B) paired with SentenceTransformers gave the best accuracy in their tests (80% single-doc; 100% on a small multi-doc comparison vs GPT-4) but was very slow. Post-training weight quantization cut average latency from minutes to ~7.6s and reduced GPU memory use (17.56GB -> 11.93GB). A small QLoRA fine-tune on 1,250 QA pairs performed poorly and produced hallucinations. Evaluations are small and rely on manual checks and GPT-4 as a reference, so results are preliminary.

Problem Statement

Clinical notes hold critical patient facts but are long and unstructured. Clinicians and researchers need a fast, conversational way to pull exact details from notes without expensive model fine-tuning.

Main Contribution

A working RAG-based conversational system (LangChain + vector DB) for querying clinical notes.

Empirical comparison of multiple embedding models and open-source LLMs on clinical-note Q&A.

Key Findings

Wizard Vicuna (13B) + SentenceTransformers reached top single-document accuracy

Numbers80% accuracy (single-doc eval, 5 QA pairs)

Practical UseUse a strong open-source 13B LLM with semantic embeddings when accuracy matters; expect high GPU cost.

Evidence RefFigure 2; Section 5.1.2

Wizard Vicuna matched GPT-4 outputs on a small multi-document test

Numbers100% accuracy vs GPT-4 on reported multi-doc examples

Practical UseRAG with a high-capacity LLM can reproduce GPT-4–style answers on limited cases, but check latency and coverage first.

Evidence RefSection 5.3; Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy80%5 manual QA pairs from MIMICWizard Vicuna + SentenceTransformers top single-doc pairingSection 5.1.2; Figure 2
Accuracy100% (Wizard Vicuna)60% (Flan T5)+40 ppmulti-doc synthetic/MIMIC examplesWizard Vicuna matched GPT-4 answers on reported examples; Flan T5 was lowerSection 5.3; Table 5

What To Try In 7 Days

Build a small LangChain RAG pipeline over a deidentified notes subset and test semantic embeddings.

Compare a 3B model vs a 13B open-source model on a few important queries to measure latency vs accuracy trade-offs.

Apply post-training 8/16-bit quantization and measure latency and GPU memory before investing in larger infra.

Agent Features

Memory
short-term conversation history
Tool Use
LangChainvector DB retrieverSentenceTransformers embeddings
Frameworks
LangChain
Architectures
transformer LLMs

Optimization Features

Infra Optimization
trade-offs noted due to GPU RAM constraints
Model Optimization
post-training weight quantizationLoRA
System Optimization
chunking with overlap to respect context windows
Training Optimization
LoRA
Inference Optimization
quantization reduced latency ~48xselecting smaller models (3B) for speed

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Very small evaluation: primary accuracy claims come from 5 manual QA pairs or limited synthetic tests.

Used GPT-4 as a reference in some comparisons; GPT-4 itself can hallucinate.

When Not To Use

For high-stakes clinical decision making without human oversight.

When you lack compute resources for large models and cannot quantize safely.

Failure Modes

Model hallucination: confident but incorrect answers after fine-tuning or generation.

Excessive latency with large models making real-time use impractical without quantization.

Core Entities

Models

Wizard Vicuna (13B)Vicuna (13B)RedPajama-Chat (7B)Alpaca v1/v2 (7B/13B)Med Alpaca (7B)GPT-4 x Alpaca (13B)FastChat - T5 (3B)Flan T5 xl (3B)LexPodLM (13B)

Metrics

Accuracyinference_time_secondsgpu_memory_gbaverage_latency_seconds

Datasets

MIMIC-IVMIMIC-IV-NoteAnnotated QA pairs (1250, from prior work)