MedInjection-FR: 571K French biomedical instruction pairs show native data helps most; mixed sources add value

March 6, 20268 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

0

Authors

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour

Links

Abstract / PDF

Why It Matters For Business

When native-domain instruction data are scarce, mixing translated and synthetic examples with limited native supervision can deliver competitive model performance while lowering data-collection costs.

Summary TLDR

The authors release MedInjection-FR, a 571K-pair French biomedical instruction dataset made from native (77K), synthetic (76K), and translated (417K) sources. They fine-tune Qwen-4B-Instruct under seven controlled mixes (33,493 examples per setup) and find: native data gives the biggest single-source gains; mixing native with translated or synthetic data often improves results further; synthetic-only performs worst but helps when balanced with native. Evaluation combines automatic metrics, LLM-as-a-judge, and human expert checks and flags evaluation caveats like verbosity and positional bias.

Problem Statement

High-quality French biomedical instruction pairs are scarce. Can translated or LLM-generated synthetic data substitute for native French supervision when tuning LLMs for medical tasks?

Main Contribution

Release MedInjection-FR: 571,436 French biomedical instruction-response pairs that mix native, synthetic, and translated sources.

Design a controlled ablation framework (seven equal-size configurations) to isolate the effect of data provenance on instruction tuning.

Empirically show native data best anchors performance; mixed datasets (especially native+translated) give complementary gains; synthetic-only is weakest but useful combined with native.

Evaluate OEQ with automatic metrics, LLM-as-a-judge and human expert calibration and expose evaluation blind spots (verbosity bias, position bias).

Key Findings

MedInjection-FR totals 571,436 instruction-response pairs composed of native, synthetic, and translated sources.

NumbersTotal=571,436; Translated=417,674; Native=77,247; Synthetic=76,506

Native-only fine-tuning gives the best single-source MCQA performance.

NumbersAggregated constrained EM: NAT=40.59 vs Base=37.31 and SYN=29.73

Combining native with translated examples often yields the highest overall MCQA accuracy.

NumbersNAT-TRAD aggregated constrained EM=41.37 (best) vs ALL=40.97

Synthetic-only supervision underperforms but adds value when balanced with native data.

NumbersSYN aggregated constrained EM=29.73; NAT-SYN constrained EM=39.25

LLM judges correlate with human expert ratings better than lexical metrics; a domain-tuned LLM performed best as judge.

NumbersPearson r (MedGemma-27B vs human)=0.61; ROUGE-2 r≈0.36; BLEU r≈0.02

Evaluation artifacts affect absolute scores but not comparative trends.

NumbersRandomizing answer order drops base constrained EM from 37.31 to 23.20, but rankings (native>translated>synthetic) stay.

Results

MedInjection-FR size by source

ValueTotal=571,436; Translated=417,674; Native=77,247; Synthetic=76,506

Aggregated MCQA Exact Match (constrained decoding)

ValueNAT=40.59; TRAD=36.44; SYN=29.73; NAT-TRAD=41.37; ALL=40.97; Base=37.31

BaselineQwen-4B base=37.31

Accuracy

ValueBase=0.36; NAT=0.24; SYN=0.31; ALL=0.25

BaselineBase model=0.36

LLM judge vs human Pearson correlation

ValueMedGemma-27B r=0.61; GEMINI-FLASH-2.0 r=0.57; GPT-4.1-mini r=0.49

BaselineTop judge r=0.61

Position bias effect on constrained EM (base)

ValueBase constrained EM single-run=37.31 -> permuted=23.20

BaselineSingle-run constrained EM=37.31

Who Should Care

What To Try In 7 Days

Start a small native French dataset collection (5k–10k expert pairs) as an anchor for further tuning.

Translate reliable English biomedical instruction sets and combine them with native samples in a 1:1 ratio for pilot fine-tuning.

Use a domain-specialized LLM (if available) to pre-screen synthetic/translated examples before training to reduce noise.

Optimization Features

Infra Optimization

  • Used Qwen-4B to balance capacity and compute; experiments run on HPC resources

Model Optimization

  • DoRA adapters applied to attention and FFN projections

Training Optimization

  • SFT
  • 10 epochs, batch size 12, LR=1e-4, cosine schedule

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use a single backbone (Qwen-4B-Instruct); results may differ at other scales.
  • Synthetic examples were not filtered by quality score before fine-tuning, possibly leaving noisy pairs.
  • Translated data dominate dataset size, which may hide content/topic mismatches despite good automatic translation scores.

When Not To Use

  • If you have access to large-scale native expert-labeled French medical pairs; then prefer native-only supervision.
  • When clinical safety requires strict human-verified answers: synthetic/translated data can introduce factual drift.
  • For deployment without additional medical validation: models trained on mixed noisy data still need expert review.

Failure Modes

  • Models may inherit factual errors from synthetic generations or translation artifacts.
  • Evaluation metrics can be biased by response length (verbosity) or answer ordering (positional priors).
  • Findings might not transfer to much larger LLMs or different pretraining distributions.

Core Entities

Models

  • Qwen-4B-Instruct
  • GPT-4o
  • GPT-4o-mini
  • GPT-4.1-mini
  • Gemini 2.0 Flash
  • MedGemma-27B
  • HuatuoGPT-o1-72B
  • Qwen3-Next-80B-A3B-Instruct

Metrics

  • Exact Match (EM)
  • Hamming score
  • BLEU
  • ROUGE-2
  • METEOR
  • BERTScore_F1
  • Pearson r

Datasets

  • MedInjection-FR
  • FrenchMedMCQA
  • MediQAl
  • FrBMedQA
  • DEFT-2021
  • DIAMED
  • MORFITT
  • MedQA
  • PubMedQA
  • MedMCQA
  • MMLU
  • MedXpertQA

Benchmarks

  • WMT 2024 Biomedical Translation
  • MCQ/MCQU aggregated MCQA evaluation
  • Open-ended QA (OEQ) evaluation

Context Entities

Models

  • GPT-5
  • GPT-4.1-mini
  • GPT-4o-mini
  • Gemini 2.0 Flash

Metrics

  • Spearman's ρ
  • Kendall's τ

Datasets

  • WMT biomedical corpora
  • S-Editions
  • medical-cases-fr (mlabonne)

Benchmarks

  • WMT 2024 biomedical translation task