BiMediX — a bilingual English/Arabic medical Mixture-of-Experts LLM plus a 1.3M bilingual medical instruction set

Overview

Decision SnapshotNeeds Validation

The model shows solid benchmark gains and much faster serving due to MoE and QLoRA, but safety evaluation is limited and the authors state it is not ready for clinical or commercial use.

Citations4

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 30%

Novelty: 60%

Authors

Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

Links

Abstract / PDF / Code / Data

Why It Matters For Business

BiMediX shows you can deliver bilingual medical accuracy with much lower serving cost: similar or better accuracy than large 70B models while running 8x faster, making research deployments and low-latency prototypes cheaper.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Founder

Summary TLDR

BiMediX is a bilingual (English/Arabic) medical chatbot built by instruction-tuning a Mixtral MoE model on BiMed1.3M — a 1.3M-sample, 632M-token bilingual medical instruction set. The team used a semi-automated English→Arabic translation pipeline with human checks to create an Arabic medical benchmark. BiMediX improves diagnostic-style multiple-choice accuracy modestly over strong English medical baselines (≈+2.5% vs Med42) while delivering much faster inference (≈180.6 tps vs ~20.9 tps for 70B models). The release includes code, model, and evaluation assets for research use only; it is not ready for clinical deployment.

Problem Statement

Medical LLM work is dominated by English models and large dense models. Arabic medical data and evaluation tools are scarce. There is a need for a bilingual medical LLM that can (1) handle multi-turn clinical-style chats, (2) work in Arabic and English, and (3) be efficient to run.

Main Contribution

BiMediX: the first bilingual (English/Arabic) medical Mixture-of-Experts (MoE) LLM designed for multi-turn chats, MCQA, and open QA.

BiMed1.3M: an instruction-tuning dataset of ~1.31M bilingual medical interactions (632.3M tokens), including 249.7k synthesized multi-turn chats.

Key Findings

BiMediX beats Med42 and Meditron on English medical benchmarks.

Numbersavg +2.5% vs Med42; +4.1% vs Meditron (English benchmarks)

Practical UseYou can get small but consistent accuracy gains on medical MCQA without per-dataset fine-tuning; use bilingual instruction tuning to lift English clinical accuracy.

Evidence RefTable 4, Sec. 4.2.2

BiMediX substantially improves Arabic medical accuracy over a large Arabic-centric model.

Numbersavg +10% vs Jais-30B (Arabic); +15% on bilingual evals

Practical UseTranslating and bilingual-tuning on domain data yields large Arabic gains; apply similar bilingual data creation to boost low-resource language results.

Evidence RefTables 2–3, Sec. 4.2.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	65.4% (BiMediX)	55.0% (Mixtral-8x7B)	+10.4 pp	Bilingual benchmark (Table 2)	Table 2: BiMediX AVG 65.4 vs Mixtral 55.0	Table 2
Accuracy	56.5% (BiMediX bilingual)	46.1% (Jais-30B)	+10.4 pp	Arabic benchmark (Table 3)	Table 3: BiMediX (Bilingual) AVG 56.5 vs Jais 46.1	Table 3

What To Try In 7 Days

Download BiMediX code/model and run the bilingual benchmark locally to profile latency and accuracy on your hardware.

Use the BiMed1.3M subset and QLoRA adapters to finetune a smaller Mixtral or Llama family model for domain-specific workflows.

Translate 5–10k domain examples into your target language using the paper's semi-automated pipeline and validate translations with native speakers.

Agent Features

Memory

32,768-token context window

Frameworks

LoRAPEFTPyTorchDeepspeedZeRO

Architectures

MoEMixtral-8x7B (sparse experts, router)

Optimization Features

Token Efficiency

Instruction tuning on 632M medical tokens focused on domain interactions

Infra Optimization

8× A100-80GB GPUs; training completed in ~35 hours

Model Optimization

Sparse MoE active params reduced (~13B active of 47B total)Router directs tokens to two experts for sparse compute

System Optimization

Adapter low-rank setup (rank 128, α=64) to avoid full fine-tuning

Training Optimization

LoRAGradient checkpointing and Deepspeed ZeRO for memory efficiency

Inference Optimization

Sparse activation yields large throughput gains (180.6 tps)Lower latency makes real-time chat feasible

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/mbzuai-oryx/BiMediX

Data URLs

https://huggingface.co/BiMediX https://github.com/mbzuai-oryx/BiMediX

Risks & Boundaries

Limitations

Model can hallucinate, produce toxic or biased outputs

Limited human evaluation; medical accuracy not guaranteed

When Not To Use

Do not use for clinical decision-making or unattended diagnoses

Avoid deployment in high-stakes patient care without expert oversight

Failure Modes

Hallucinated facts or incorrect medical recommendations

Overconfident but wrong answers on MCQA or open questions

Core Entities

Models

BiMediXMixtral-8x7BMixtralJais-30BMed42-70BMeditron-70BPMC-LLaMA-13BClinical Camel-70B

Metrics

Accuracylatency (s)tokens/sec

Datasets

BiMed1.3MPubMedQAMedMCQAMedQAMedical MMLUHealthCareMagiciCliniqMedical MeadowUMLSLiveQAMedicationQA

Benchmarks

BiMediX Arabic benchmark (this paper)Bilingual benchmark (this paper)MedMCQAMedQAPubMedQAMedical MMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

BiMediX beats Med42 and Meditron on English medical benchmarks.

BiMediX substantially improves Arabic medical accuracy over a large Arabic-centric model.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Key finding

ChipExpert: Open-source LLM tuned for integrated-circuit design

Key finding