Clinical Camel: an open medical LLM fine-tuned with dialogue synthesis and single‑GPU QLoRA

May 19, 20237 min

Overview

Production Readiness

0.2

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

35

Authors

Augustin Toma, Patrick R. Lawler, Jimmy Ba, Rahul G. Krishnan, Barry B. Rubin, Bo Wang

Links

Abstract / PDF

Why It Matters For Business

An open, high-performing medical LLM reduces vendor lock-in, enables internal validation, and can be reproduced with modest compute, letting institutions experiment safely before any clinical adoption.

Summary TLDR

Clinical Camel is an openly released medical language model fine-tuned from LLaMA-2 using QLoRA and a new data method called Dialogue-Based Knowledge Encoding (DBKE). Trained on one H100 GPU, the 70B variant outperforms GPT-3.5 on multiple medical QA benchmarks in five-shot tests (e.g., USMLE 64.3% vs 58.5%, PubMedQA 77.9% vs 60.2%). The authors convert clinical review articles into synthetic doctor–patient dialogues to teach clinical reasoning. The model is promising for research and note synthesis but is not ready for clinical deployment due to limited human evaluation, potential hallucinations, and outdated pre-2021 training sources.

Problem Statement

Proprietary LLMs perform well on medical tasks but are opaque and hard to validate. Open medical models exist but underperform and have short context windows. The paper aims to build an open, high-performing clinical LLM that can be trained efficiently and evaluated transparently.

Main Contribution

Clinical Camel: open medical LLM variants (13B and 70B) fine-tuned from LLaMA-2 and released on Hugging Face.

Dialogue-Based Knowledge Encoding (DBKE): a pipeline that turns dense clinical texts into multi-turn synthetic dialogues for finetuning.

Efficient single‑GPU fine-tuning using QLoRA with 4096 token context, showing competitive benchmark performance vs GPT-3.5.

Benchmark evaluation across medical QA datasets showing Clinical Camel-70B surpasses GPT-3.5 on assessed metrics in five-shot settings.

Key Findings

Clinical Camel-70B beats GPT-3.5 on several medical QA benchmarks in five-shot tests.

NumbersUSMLE 64.3% vs GPT-3.5 58.5%; PubMedQA 77.9% vs 60.2%

Clinical Camel-70B shows consistent gains over GPT-3.5 across other benchmarks.

NumbersMedQA 60.7% vs 53.6%; MedMCQA 54.2% vs 51.0%

The model was trained efficiently on commodity hardware using QLoRA.

NumbersSingle H100 GPU, one epoch training (Table 2)

DBKE produced a large synthetic dialogue corpus from clinical articles.

Numbers20,000 articles → ~100,000 dialogues (avg. 5 exchanges)

Automated benchmarks alone are insufficient to claim clinical safety.

Results

Accuracy

Value64.3% (Clinical Camel-70B)

BaselineGPT-3.5 58.5%

Accuracy

Value77.9% (Clinical Camel-70B)

BaselineGPT-3.5 60.2%

Accuracy

Value60.7% (Clinical Camel-70B)

BaselineGPT-3.5 53.6%

Accuracy

Value54.2% (Clinical Camel-70B)

BaselineGPT-3.5 51.0%

Who Should Care

What To Try In 7 Days

Download Clinical Camel from Hugging Face and run the provided inference examples.

Reproduce a few benchmark queries (USMLE, PubMedQA) to validate claims on your infra.

Convert a small set of domain articles into DBKE-style dialogues to test domain transfer rapidly.

Agent Features

Architectures

  • LLaMA-2 base (13B, 70B)

Optimization Features

Token Efficiency

  • Sequence length truncated to 4096 tokens

Infra Optimization

  • Fine-tuning performed on a single commercial H100 GPU

Model Optimization

  • LoRA

System Optimization

  • Single H100 GPU training workflow

Training Optimization

  • Masked user inputs during finetuning
  • One-epoch training with gradient accumulation
  • Paged AdamW optimizer

Inference Optimization

  • 4096 token context for longer conversations

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Not validated with rigorous human clinical evaluation.
  • Training data cut off before 2021; knowledge may be outdated.
  • DBKE utility is hypothesized but not proven in controlled comparisons.
  • Model is not multimodal; cannot process images or scans.
  • Open release includes a non-clinical-use license but risks remain if used improperly.

When Not To Use

  • For real patient care or diagnosis without human oversight.
  • Where up-to-date clinical guidelines are required.
  • For image-based diagnostic tasks (no multimodal support).

Failure Modes

  • Hallucinated or misleading clinical statements.
  • Outdated medical facts from pre-2021 corpus.
  • Biases from training data producing unfair outputs.
  • Overconfident answers without citing evidence.

Core Entities

Models

  • Clinical Camel-13B
  • Clinical Camel-70B
  • LLaMA-2
  • GPT-3.5
  • GPT-4
  • Med-PaLM 2

Metrics

  • Accuracy

Datasets

  • ShareGPT
  • Clinical review articles (PubMed, pre-2021)
  • MedQA
  • MedMCQA
  • PubMedQA
  • MMLU (medical subsets)
  • USMLE Sample Exam

Benchmarks

  • USMLE Sample Exam
  • PubMedQA
  • MedQA
  • MedMCQA
  • MMLU medical subsets