MedInjection-FR: 571K French biomedical instruction pairs show native data helps most; mixed sources add value

Overview

Decision SnapshotReady For Pilot

The paper provides controlled experiments and multiple evaluation modalities; findings are robust for Qwen-4B but limited to one backbone and specific datasets.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour

Links

Abstract / PDF / Code / Data

Why It Matters For Business

When native-domain instruction data are scarce, mixing translated and synthetic examples with limited native supervision can deliver competitive model performance while lowering data-collection costs.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

The authors release MedInjection-FR, a 571K-pair French biomedical instruction dataset made from native (77K), synthetic (76K), and translated (417K) sources. They fine-tune Qwen-4B-Instruct under seven controlled mixes (33,493 examples per setup) and find: native data gives the biggest single-source gains; mixing native with translated or synthetic data often improves results further; synthetic-only performs worst but helps when balanced with native. Evaluation combines automatic metrics, LLM-as-a-judge, and human expert checks and flags evaluation caveats like verbosity and positional bias.

Problem Statement

High-quality French biomedical instruction pairs are scarce. Can translated or LLM-generated synthetic data substitute for native French supervision when tuning LLMs for medical tasks?

Main Contribution

Release MedInjection-FR: 571,436 French biomedical instruction-response pairs that mix native, synthetic, and translated sources.

Design a controlled ablation framework (seven equal-size configurations) to isolate the effect of data provenance on instruction tuning.

Key Findings

MedInjection-FR totals 571,436 instruction-response pairs composed of native, synthetic, and translated sources.

NumbersTotal=571,436; Translated=417,674; Native=77,247; Synthetic=76,506

Practical UseYou can fine-tune French biomedical LLMs at scale without collecting all native data locally; mixing datasets is feasible.

Evidence RefSection 3.1, Table 1

Native-only fine-tuning gives the best single-source MCQA performance.

NumbersAggregated constrained EM: NAT=40.59 vs Base=37.31 and SYN=29.73

Practical UsePrioritize collecting or curating native French medical instructions when possible to get the largest single-source gains.

Evidence RefSection 5.1, Table 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
MedInjection-FR size by source	Total=571,436; Translated=417,674; Native=77,247; Synthetic=76,506	—	—	Table 1	Section 3.1, Table 1	Table 1
Aggregated MCQA Exact Match (constrained decoding)	NAT=40.59; TRAD=36.44; SYN=29.73; NAT-TRAD=41.37; ALL=40.97; Base=37.31	Qwen-4B base=37.31	NAT vs Base: +3.28; NAT-TRAD vs Base: +4.06	Aggregated MCQ+MCQU (Table 5)	Section 5.1, Table 5	Table 5

What To Try In 7 Days

Start a small native French dataset collection (5k–10k expert pairs) as an anchor for further tuning.

Translate reliable English biomedical instruction sets and combine them with native samples in a 1:1 ratio for pilot fine-tuning.

Use a domain-specialized LLM (if available) to pre-screen synthetic/translated examples before training to reduce noise.

Optimization Features

Infra Optimization

Used Qwen-4B to balance capacity and compute; experiments run on HPC resources

Model Optimization

DoRA adapters applied to attention and FFN projections

Training Optimization

SFT10 epochs, batch size 12, LR=1e-4, cosine schedule

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/ikram28/MedInjection-FR

Data URLs

https://github.com/ikram28/MedInjection-FR

Risks & Boundaries

Limitations

Experiments use a single backbone (Qwen-4B-Instruct); results may differ at other scales.

Synthetic examples were not filtered by quality score before fine-tuning, possibly leaving noisy pairs.

When Not To Use

If you have access to large-scale native expert-labeled French medical pairs; then prefer native-only supervision.

When clinical safety requires strict human-verified answers: synthetic/translated data can introduce factual drift.

Failure Modes

Models may inherit factual errors from synthetic generations or translation artifacts.

Evaluation metrics can be biased by response length (verbosity) or answer ordering (positional priors).

Core Entities

Models

Qwen-4B-InstructGPT-4oGPT-4o-miniGPT-4.1-miniGemini 2.0 FlashMedGemma-27BHuatuoGPT-o1-72BQwen3-Next-80B-A3B-Instruct

Metrics

Exact Match (EM)Hamming scoreBLEUROUGE-2METEORBERTScore_F1Pearson r

Datasets

MedInjection-FRFrenchMedMCQAMediQAlFrBMedQADEFT-2021DIAMEDMORFITTMedQAPubMedQAMedMCQAMMLUMedXpertQA

Benchmarks

WMT 2024 Biomedical TranslationMCQ/MCQU aggregated MCQA evaluationOpen-ended QA (OEQ) evaluation

Context Entities

Models

GPT-5GPT-4.1-miniGPT-4o-miniGemini 2.0 Flash

Metrics

Spearman's ρKendall's τ

Datasets

WMT biomedical corporaS-Editionsmedical-cases-fr (mlabonne)

Benchmarks

WMT 2024 biomedical translation task

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MedInjection-FR totals 571,436 instruction-response pairs composed of native, synthetic, and translated sources.

Native-only fine-tuning gives the best single-source MCQA performance.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding