Jointly train retriever and medical LLM to improve accuracy, reduce hallucinations, and cut training cost

February 27, 20247 min

Overview

Decision SnapshotNeeds Validation

Method is practical and shows consistent gains on multiple medical QA sets, but human evaluation is small and domain generality is not tested.

Citations11

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: CC-BY 4.0 (stated intent upon acceptance)

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Junda Wang, Zhichao Yang, Zonghai Yao, Hong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Joint retriever+LLM fine-tuning yields better medical QA accuracy and explanations while cutting training compute by orders of magnitude versus large-domain pretraining, making domain-specialized models cheaper and faster to build.

Who Should Care

Summary TLDR

This paper introduces JMLR, a training method that updates a retriever and an LLM together so the retriever learns which documents actually help the LLM produce correct medical answers. JMLR-13B reaches an average accuracy of 70.5% across medical QA benchmarks (vs Meditron-70B 68.9% and RAG-13B 67.7%) and improves factuality and rationale quality. Joint training uses an LLM-driven rank loss and dynamic sampling of candidate docs (top-30 sampled, top-7 used) and reduces training compute: JMLR-13B ~148 GPU hours vs Meditron-70B ~42,630 GPU hours. Evaluations include automated metrics (accuracy, UMLS-F, GPT-4 scoring) and small human expert comparisons.

Problem Statement

Medical LLMs hallucinate and can miss or misapply domain knowledge. Traditional RAG methods train retriever separately from LLMs or continue pretraining on domain text, which can be slow or misaligned. The paper asks: can simultaneously training retriever and LLM on QA pairs make retrieval more helpful and reduce hallucinations while saving compute?

Main Contribution

Propose JMLR, a joint training method that updates retriever and LLM together using an LLM-driven rank loss.

Show JMLR-13B achieves higher average accuracy on multiple medical QA benchmarks than prior open- and closed-source baselines.

Key Findings

JMLR-13B achieves the highest reported average accuracy across evaluated medical QA sets.

NumbersAvg accuracy 70.5% (JMLR-13B) vs 68.9% (Meditron-70B)

Practical UseSwitching to joint retriever+LLM fine-tuning can yield modest accuracy gains over large-domain pretraining on evaluated medical QA tasks.

Evidence RefTable 2

Joint training on a 7B model gives large improvements over domain pretraining.

NumbersJMLR-7B avg 62.3% vs Meditron-7B 53.2% (≈+9.1pp)

Practical UseFor resource-limited teams, joint retrieval training on a smaller LLM can outperform expensive domain pretraining.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy70.5% (JMLR-13B)68.9% (Meditron-70B)+1.6 ppTable 2 averagedTable 2 shows JMLR-13B 70.5 vs Meditron-70B 68.9Table 2
Accuracy62.3% (JMLR-7B)53.2% (Meditron-7B)+9.1 ppTable 3 averagedTable 3 reports JMLR-7B 62.3 vs Meditron-7B 53.2Table 3

What To Try In 7 Days

Fine-tune a small open LLM with a ColBERT retriever using an LLM-driven rank loss on a narrow corpus.

Experiment with retrieving 7 documents per query (paper found 7 optimal) and compare accuracy vs 1–10 docs.

Run GPT-4 or domain-expert checks on generated rationales to track factuality (UMLS-F and GPT-4 scoring are used here).

Optimization Features

System Optimization
Use S2-Attn to handle long contexts efficiently
Training Optimization
Jointly update retriever and LLM to align retrieval with answer utilityWeighted sampling from top-30 retriever candidates during training

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseCC-BY 4.0 (stated intent upon acceptance)

Risks & Boundaries

Limitations

Focused only on medical QA; transfer to other domains untested.

Human evaluation limited to three doctors and a small sample, reducing statistical power.

When Not To Use

If you lack a relevant domain corpus of documents to retrieve from.

For high-stakes clinical deployment without independent expert oversight and validation.

Failure Modes

Retriever selects irrelevant or misleading documents, misleading the LLM.

Over-reliance on retrieved content can propagate biases present in guidelines or corpora.

Core Entities

Models

JMLR-7BJMLR-13BRAG-7BRAG-13BMeditron-7BMeditron-70BLlama-2-7BGPT-3.5GPT-4Claude3-OpusColBERT

Metrics

AccuracyUMLS-F (factuality F1)GPT-4 score (1-5 Likert)Cohen's Kappa

Datasets

MedQAAmbossMedMCQAMMLU-MedicalPubMedMIMIC-IVmedical textbooks

Benchmarks

USMLE-style MedQAAmboss question bankMedMCQAMMLU-Medical