Jointly train retriever and medical LLM to improve accuracy, reduce hallucinations, and cut training cost

Overview

Decision SnapshotNeeds Validation

Method is practical and shows consistent gains on multiple medical QA sets, but human evaluation is small and domain generality is not tested.

Citations11

Evidence Strength0.70

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: CC-BY 4.0 (stated intent upon acceptance)

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Junda Wang, Zhichao Yang, Zonghai Yao, Hong Yu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Joint retriever+LLM fine-tuning yields better medical QA accuracy and explanations while cutting training compute by orders of magnitude versus large-domain pretraining, making domain-specialized models cheaper and faster to build.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

This paper introduces JMLR, a training method that updates a retriever and an LLM together so the retriever learns which documents actually help the LLM produce correct medical answers. JMLR-13B reaches an average accuracy of 70.5% across medical QA benchmarks (vs Meditron-70B 68.9% and RAG-13B 67.7%) and improves factuality and rationale quality. Joint training uses an LLM-driven rank loss and dynamic sampling of candidate docs (top-30 sampled, top-7 used) and reduces training compute: JMLR-13B ~148 GPU hours vs Meditron-70B ~42,630 GPU hours. Evaluations include automated metrics (accuracy, UMLS-F, GPT-4 scoring) and small human expert comparisons.

Problem Statement

Medical LLMs hallucinate and can miss or misapply domain knowledge. Traditional RAG methods train retriever separately from LLMs or continue pretraining on domain text, which can be slow or misaligned. The paper asks: can simultaneously training retriever and LLM on QA pairs make retrieval more helpful and reduce hallucinations while saving compute?

Main Contribution

Propose JMLR, a joint training method that updates retriever and LLM together using an LLM-driven rank loss.

Show JMLR-13B achieves higher average accuracy on multiple medical QA benchmarks than prior open- and closed-source baselines.

Key Findings

JMLR-13B achieves the highest reported average accuracy across evaluated medical QA sets.

NumbersAvg accuracy 70.5% (JMLR-13B) vs 68.9% (Meditron-70B)

Practical UseSwitching to joint retriever+LLM fine-tuning can yield modest accuracy gains over large-domain pretraining on evaluated medical QA tasks.

Evidence RefTable 2

Joint training on a 7B model gives large improvements over domain pretraining.

NumbersJMLR-7B avg 62.3% vs Meditron-7B 53.2% (≈+9.1pp)

Practical UseFor resource-limited teams, joint retrieval training on a smaller LLM can outperform expensive domain pretraining.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	70.5% (JMLR-13B)	68.9% (Meditron-70B)	+1.6 pp	Table 2 averaged	Table 2 shows JMLR-13B 70.5 vs Meditron-70B 68.9	Table 2
Accuracy	62.3% (JMLR-7B)	53.2% (Meditron-7B)	+9.1 pp	Table 3 averaged	Table 3 reports JMLR-7B 62.3 vs Meditron-7B 53.2	Table 3

What To Try In 7 Days

Fine-tune a small open LLM with a ColBERT retriever using an LLM-driven rank loss on a narrow corpus.

Experiment with retrieving 7 documents per query (paper found 7 optimal) and compare accuracy vs 1–10 docs.

Run GPT-4 or domain-expert checks on generated rationales to track factuality (UMLS-F and GPT-4 scoring are used here).

Optimization Features

System Optimization

Use S2-Attn to handle long contexts efficiently

Training Optimization

Jointly update retriever and LLM to align retrieval with answer utilityWeighted sampling from top-30 retriever candidates during training

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseCC-BY 4.0 (stated intent upon acceptance)

Code URLs

https://github.com/believewhat/JMLR-JointMedical-LLM-and-Retrieval-Training

Data URLs

https://physionet.org/content/mimiciv/1.0/

Risks & Boundaries

Limitations

Focused only on medical QA; transfer to other domains untested.

Human evaluation limited to three doctors and a small sample, reducing statistical power.

When Not To Use

If you lack a relevant domain corpus of documents to retrieve from.

For high-stakes clinical deployment without independent expert oversight and validation.

Failure Modes

Retriever selects irrelevant or misleading documents, misleading the LLM.

Over-reliance on retrieved content can propagate biases present in guidelines or corpora.

Core Entities

Models

JMLR-7BJMLR-13BRAG-7BRAG-13BMeditron-7BMeditron-70BLlama-2-7BGPT-3.5GPT-4Claude3-OpusColBERT

Metrics

AccuracyUMLS-F (factuality F1)GPT-4 score (1-5 Likert)Cohen's Kappa

Datasets

MedQAAmbossMedMCQAMMLU-MedicalPubMedMIMIC-IVmedical textbooks

Benchmarks

USMLE-style MedQAAmboss question bankMedMCQAMMLU-Medical

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

JMLR-13B achieves the highest reported average accuracy across evaluated medical QA sets.

Joint training on a 7B model gives large improvements over domain pretraining.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Add explicit, verifiable rationales and reranking to RAG to cut hallucinations in biomedical QA

Key finding

Teach LLMs to spot and avoid context-based hallucinations by masking retrieval heads and contrastive tuning

Key finding

Fin-RATE: a realistic SEC-filings benchmark that stresses cross-document, cross-year and cross-company financial reasoning

Key finding

Not all retrieval noise is bad: some noises consistently help LLMs, others break them

Key finding

Marathon: a multiple-choice benchmark that stresses LLMs with very long documents (up to ~260K chars)

Key finding