BiMediX — a bilingual English/Arabic medical Mixture-of-Experts LLM plus a 1.3M bilingual medical instruction set

February 20, 20248 min

Overview

Decision SnapshotNeeds Validation

The model shows solid benchmark gains and much faster serving due to MoE and QLoRA, but safety evaluation is limited and the authors state it is not ready for clinical or commercial use.

Citations4

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 30%

Novelty: 60%

Authors

Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BiMediX shows you can deliver bilingual medical accuracy with much lower serving cost: similar or better accuracy than large 70B models while running 8x faster, making research deployments and low-latency prototypes cheaper.

Who Should Care

Summary TLDR

BiMediX is a bilingual (English/Arabic) medical chatbot built by instruction-tuning a Mixtral MoE model on BiMed1.3M — a 1.3M-sample, 632M-token bilingual medical instruction set. The team used a semi-automated English→Arabic translation pipeline with human checks to create an Arabic medical benchmark. BiMediX improves diagnostic-style multiple-choice accuracy modestly over strong English medical baselines (≈+2.5% vs Med42) while delivering much faster inference (≈180.6 tps vs ~20.9 tps for 70B models). The release includes code, model, and evaluation assets for research use only; it is not ready for clinical deployment.

Problem Statement

Medical LLM work is dominated by English models and large dense models. Arabic medical data and evaluation tools are scarce. There is a need for a bilingual medical LLM that can (1) handle multi-turn clinical-style chats, (2) work in Arabic and English, and (3) be efficient to run.

Main Contribution

BiMediX: the first bilingual (English/Arabic) medical Mixture-of-Experts (MoE) LLM designed for multi-turn chats, MCQA, and open QA.

BiMed1.3M: an instruction-tuning dataset of ~1.31M bilingual medical interactions (632.3M tokens), including 249.7k synthesized multi-turn chats.

Key Findings

BiMediX beats Med42 and Meditron on English medical benchmarks.

Numbersavg +2.5% vs Med42; +4.1% vs Meditron (English benchmarks)

Practical UseYou can get small but consistent accuracy gains on medical MCQA without per-dataset fine-tuning; use bilingual instruction tuning to lift English clinical accuracy.

Evidence RefTable 4, Sec. 4.2.2

BiMediX substantially improves Arabic medical accuracy over a large Arabic-centric model.

Numbersavg +10% vs Jais-30B (Arabic); +15% on bilingual evals

Practical UseTranslating and bilingual-tuning on domain data yields large Arabic gains; apply similar bilingual data creation to boost low-resource language results.

Evidence RefTables 2–3, Sec. 4.2.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy65.4% (BiMediX)55.0% (Mixtral-8x7B)+10.4 ppBilingual benchmark (Table 2)Table 2: BiMediX AVG 65.4 vs Mixtral 55.0Table 2
Accuracy56.5% (BiMediX bilingual)46.1% (Jais-30B)+10.4 ppArabic benchmark (Table 3)Table 3: BiMediX (Bilingual) AVG 56.5 vs Jais 46.1Table 3

What To Try In 7 Days

Download BiMediX code/model and run the bilingual benchmark locally to profile latency and accuracy on your hardware.

Use the BiMed1.3M subset and QLoRA adapters to finetune a smaller Mixtral or Llama family model for domain-specific workflows.

Translate 5–10k domain examples into your target language using the paper's semi-automated pipeline and validate translations with native speakers.

Agent Features

Memory
32,768-token context window
Frameworks
LoRAPEFTPyTorchDeepspeedZeRO
Architectures
MoEMixtral-8x7B (sparse experts, router)

Optimization Features

Token Efficiency
Instruction tuning on 632M medical tokens focused on domain interactions
Infra Optimization
8× A100-80GB GPUs; training completed in ~35 hours
Model Optimization
Sparse MoE active params reduced (~13B active of 47B total)Router directs tokens to two experts for sparse compute
System Optimization
Adapter low-rank setup (rank 128, α=64) to avoid full fine-tuning
Training Optimization
LoRAGradient checkpointing and Deepspeed ZeRO for memory efficiency
Inference Optimization
Sparse activation yields large throughput gains (180.6 tps)Lower latency makes real-time chat feasible

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Model can hallucinate, produce toxic or biased outputs

Limited human evaluation; medical accuracy not guaranteed

When Not To Use

Do not use for clinical decision-making or unattended diagnoses

Avoid deployment in high-stakes patient care without expert oversight

Failure Modes

Hallucinated facts or incorrect medical recommendations

Overconfident but wrong answers on MCQA or open questions

Core Entities

Models

BiMediXMixtral-8x7BMixtralJais-30BMed42-70BMeditron-70BPMC-LLaMA-13BClinical Camel-70B

Metrics

Accuracylatency (s)tokens/sec

Datasets

BiMed1.3MPubMedQAMedMCQAMedQAMedical MMLUHealthCareMagiciCliniqMedical MeadowUMLSLiveQAMedicationQA

Benchmarks

BiMediX Arabic benchmark (this paper)Bilingual benchmark (this paper)MedMCQAMedQAPubMedQAMedical MMLU