BiMediX — a bilingual English/Arabic medical Mixture-of-Experts LLM plus a 1.3M bilingual medical instruction set

February 20, 20248 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

4

Authors

Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

Links

Abstract / PDF

Why It Matters For Business

BiMediX shows you can deliver bilingual medical accuracy with much lower serving cost: similar or better accuracy than large 70B models while running 8x faster, making research deployments and low-latency prototypes cheaper.

Summary TLDR

BiMediX is a bilingual (English/Arabic) medical chatbot built by instruction-tuning a Mixtral MoE model on BiMed1.3M — a 1.3M-sample, 632M-token bilingual medical instruction set. The team used a semi-automated English→Arabic translation pipeline with human checks to create an Arabic medical benchmark. BiMediX improves diagnostic-style multiple-choice accuracy modestly over strong English medical baselines (≈+2.5% vs Med42) while delivering much faster inference (≈180.6 tps vs ~20.9 tps for 70B models). The release includes code, model, and evaluation assets for research use only; it is not ready for clinical deployment.

Problem Statement

Medical LLM work is dominated by English models and large dense models. Arabic medical data and evaluation tools are scarce. There is a need for a bilingual medical LLM that can (1) handle multi-turn clinical-style chats, (2) work in Arabic and English, and (3) be efficient to run.

Main Contribution

BiMediX: the first bilingual (English/Arabic) medical Mixture-of-Experts (MoE) LLM designed for multi-turn chats, MCQA, and open QA.

BiMed1.3M: an instruction-tuning dataset of ~1.31M bilingual medical interactions (632.3M tokens), including 249.7k synthesized multi-turn chats.

A semi-automated English→Arabic iterative translation pipeline with human refinement to create a high-quality Arabic medical benchmark and bilingual instructions.

Parameter-efficient bilingual instruction tuning of Mixtral-8x7B using QLoRA adapters on experts and the router (train ~4% of params) to get strong performance with limited compute.

A public bilingual evaluation benchmark (translated from existing English medical datasets) and release of code/models for research.

Key Findings

BiMediX beats Med42 and Meditron on English medical benchmarks.

Numbersavg +2.5% vs Med42; +4.1% vs Meditron (English benchmarks)

BiMediX substantially improves Arabic medical accuracy over a large Arabic-centric model.

Numbersavg +10% vs Jais-30B (Arabic); +15% on bilingual evals

BiMediX is much faster at inference than large 70B medical models.

Numbers180.6 tps and 2.8s latency vs 20.9 tps and 24.5s (70B baselines)

BiMed1.3M is large and bilingual with multi-turn dialogues.

Numbers1.31M samples, 632.3M tokens, 249.7k chat dialogues

Results

Accuracy

Value65.4% (BiMediX)

Baseline55.0% (Mixtral-8x7B)

Accuracy

Value56.5% (BiMediX bilingual)

Baseline46.1% (Jais-30B)

Accuracy

Value75.4% (BiMediX)

Baseline72.9% (Med42-70B)

Inference throughput

Value180.6 tokens/sec (BiMediX)

Baseline20.9 tokens/sec (Med42-70B)

Inference latency

Value2.8 s (BiMediX)

Baseline24.5 s (Med42-70B)

Training compute and time

Value~632M tokens, 2 epochs, 8x A100-80GB, 35 hours

Who Should Care

What To Try In 7 Days

Download BiMediX code/model and run the bilingual benchmark locally to profile latency and accuracy on your hardware.

Use the BiMed1.3M subset and QLoRA adapters to finetune a smaller Mixtral or Llama family model for domain-specific workflows.

Translate 5–10k domain examples into your target language using the paper's semi-automated pipeline and validate translations with native speakers.

Agent Features

Memory

  • 32,768-token context window

Frameworks

  • LoRA
  • PEFT
  • PyTorch
  • Deepspeed
  • ZeRO

Architectures

  • MoE
  • Mixtral-8x7B (sparse experts, router)

Optimization Features

Token Efficiency

  • Instruction tuning on 632M medical tokens focused on domain interactions

Infra Optimization

  • 8× A100-80GB GPUs; training completed in ~35 hours

Model Optimization

  • Sparse MoE active params reduced (~13B active of 47B total)
  • Router directs tokens to two experts for sparse compute

System Optimization

  • Adapter low-rank setup (rank 128, α=64) to avoid full fine-tuning

Training Optimization

  • LoRA
  • Gradient checkpointing and Deepspeed ZeRO for memory efficiency

Inference Optimization

  • Sparse activation yields large throughput gains (180.6 tps)
  • Lower latency makes real-time chat feasible

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Model can hallucinate, produce toxic or biased outputs
  • Limited human evaluation; medical accuracy not guaranteed
  • No explicit safety or alignment mechanisms integrated
  • Arabic translations depend on LLM plus human checks and may still contain errors

When Not To Use

  • Do not use for clinical decision-making or unattended diagnoses
  • Avoid deployment in high-stakes patient care without expert oversight
  • Do not replace certified medical professionals or regulatory processes

Failure Modes

  • Hallucinated facts or incorrect medical recommendations
  • Overconfident but wrong answers on MCQA or open questions
  • Translation errors that change clinical meaning
  • Biases inherited from pretraining data affecting minority groups

Core Entities

Models

  • BiMediX
  • Mixtral-8x7B
  • Mixtral
  • Jais-30B
  • Med42-70B
  • Meditron-70B
  • PMC-LLaMA-13B
  • Clinical Camel-70B

Metrics

  • Accuracy
  • latency (s)
  • tokens/sec

Datasets

  • BiMed1.3M
  • PubMedQA
  • MedMCQA
  • MedQA
  • Medical MMLU
  • HealthCareMagic
  • iCliniq
  • Medical Meadow
  • UMLS
  • LiveQA
  • MedicationQA

Benchmarks

  • BiMediX Arabic benchmark (this paper)
  • Bilingual benchmark (this paper)
  • MedMCQA
  • MedQA
  • PubMedQA
  • Medical MMLU