MedInjection-FR: 571K French biomedical instruction pairs show native data helps most; mixed sources add value

March 6, 20268 min

Overview

Decision SnapshotReady For Pilot

The paper provides controlled experiments and multiple evaluation modalities; findings are robust for Qwen-4B but limited to one backbone and specific datasets.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When native-domain instruction data are scarce, mixing translated and synthetic examples with limited native supervision can deliver competitive model performance while lowering data-collection costs.

Who Should Care

Summary TLDR

The authors release MedInjection-FR, a 571K-pair French biomedical instruction dataset made from native (77K), synthetic (76K), and translated (417K) sources. They fine-tune Qwen-4B-Instruct under seven controlled mixes (33,493 examples per setup) and find: native data gives the biggest single-source gains; mixing native with translated or synthetic data often improves results further; synthetic-only performs worst but helps when balanced with native. Evaluation combines automatic metrics, LLM-as-a-judge, and human expert checks and flags evaluation caveats like verbosity and positional bias.

Problem Statement

High-quality French biomedical instruction pairs are scarce. Can translated or LLM-generated synthetic data substitute for native French supervision when tuning LLMs for medical tasks?

Main Contribution

Release MedInjection-FR: 571,436 French biomedical instruction-response pairs that mix native, synthetic, and translated sources.

Design a controlled ablation framework (seven equal-size configurations) to isolate the effect of data provenance on instruction tuning.

Key Findings

MedInjection-FR totals 571,436 instruction-response pairs composed of native, synthetic, and translated sources.

NumbersTotal=571,436; Translated=417,674; Native=77,247; Synthetic=76,506

Practical UseYou can fine-tune French biomedical LLMs at scale without collecting all native data locally; mixing datasets is feasible.

Evidence RefSection 3.1, Table 1

Native-only fine-tuning gives the best single-source MCQA performance.

NumbersAggregated constrained EM: NAT=40.59 vs Base=37.31 and SYN=29.73

Practical UsePrioritize collecting or curating native French medical instructions when possible to get the largest single-source gains.

Evidence RefSection 5.1, Table 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
MedInjection-FR size by sourceTotal=571,436; Translated=417,674; Native=77,247; Synthetic=76,506Table 1Section 3.1, Table 1Table 1
Aggregated MCQA Exact Match (constrained decoding)NAT=40.59; TRAD=36.44; SYN=29.73; NAT-TRAD=41.37; ALL=40.97; Base=37.31Qwen-4B base=37.31NAT vs Base: +3.28; NAT-TRAD vs Base: +4.06Aggregated MCQ+MCQU (Table 5)Section 5.1, Table 5Table 5

What To Try In 7 Days

Start a small native French dataset collection (5k–10k expert pairs) as an anchor for further tuning.

Translate reliable English biomedical instruction sets and combine them with native samples in a 1:1 ratio for pilot fine-tuning.

Use a domain-specialized LLM (if available) to pre-screen synthetic/translated examples before training to reduce noise.

Optimization Features

Infra Optimization
Used Qwen-4B to balance capacity and compute; experiments run on HPC resources
Model Optimization
DoRA adapters applied to attention and FFN projections
Training Optimization
SFT10 epochs, batch size 12, LR=1e-4, cosine schedule

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments use a single backbone (Qwen-4B-Instruct); results may differ at other scales.

Synthetic examples were not filtered by quality score before fine-tuning, possibly leaving noisy pairs.

When Not To Use

If you have access to large-scale native expert-labeled French medical pairs; then prefer native-only supervision.

When clinical safety requires strict human-verified answers: synthetic/translated data can introduce factual drift.

Failure Modes

Models may inherit factual errors from synthetic generations or translation artifacts.

Evaluation metrics can be biased by response length (verbosity) or answer ordering (positional priors).

Core Entities

Models

Qwen-4B-InstructGPT-4oGPT-4o-miniGPT-4.1-miniGemini 2.0 FlashMedGemma-27BHuatuoGPT-o1-72BQwen3-Next-80B-A3B-Instruct

Metrics

Exact Match (EM)Hamming scoreBLEUROUGE-2METEORBERTScore_F1Pearson r

Datasets

MedInjection-FRFrenchMedMCQAMediQAlFrBMedQADEFT-2021DIAMEDMORFITTMedQAPubMedQAMedMCQAMMLUMedXpertQA

Benchmarks

WMT 2024 Biomedical TranslationMCQ/MCQU aggregated MCQA evaluationOpen-ended QA (OEQ) evaluation

Context Entities

Models

GPT-5GPT-4.1-miniGPT-4o-miniGemini 2.0 Flash

Metrics

Spearman's ρKendall's τ

Datasets

WMT biomedical corporaS-Editionsmedical-cases-fr (mlabonne)

Benchmarks

WMT 2024 biomedical translation task