Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
When native-domain instruction data are scarce, mixing translated and synthetic examples with limited native supervision can deliver competitive model performance while lowering data-collection costs.
Summary TLDR
The authors release MedInjection-FR, a 571K-pair French biomedical instruction dataset made from native (77K), synthetic (76K), and translated (417K) sources. They fine-tune Qwen-4B-Instruct under seven controlled mixes (33,493 examples per setup) and find: native data gives the biggest single-source gains; mixing native with translated or synthetic data often improves results further; synthetic-only performs worst but helps when balanced with native. Evaluation combines automatic metrics, LLM-as-a-judge, and human expert checks and flags evaluation caveats like verbosity and positional bias.
Problem Statement
High-quality French biomedical instruction pairs are scarce. Can translated or LLM-generated synthetic data substitute for native French supervision when tuning LLMs for medical tasks?
Main Contribution
Release MedInjection-FR: 571,436 French biomedical instruction-response pairs that mix native, synthetic, and translated sources.
Design a controlled ablation framework (seven equal-size configurations) to isolate the effect of data provenance on instruction tuning.
Empirically show native data best anchors performance; mixed datasets (especially native+translated) give complementary gains; synthetic-only is weakest but useful combined with native.
Evaluate OEQ with automatic metrics, LLM-as-a-judge and human expert calibration and expose evaluation blind spots (verbosity bias, position bias).
Key Findings
MedInjection-FR totals 571,436 instruction-response pairs composed of native, synthetic, and translated sources.
Native-only fine-tuning gives the best single-source MCQA performance.
Combining native with translated examples often yields the highest overall MCQA accuracy.
Synthetic-only supervision underperforms but adds value when balanced with native data.
LLM judges correlate with human expert ratings better than lexical metrics; a domain-tuned LLM performed best as judge.
Evaluation artifacts affect absolute scores but not comparative trends.
Results
MedInjection-FR size by source
Aggregated MCQA Exact Match (constrained decoding)
Accuracy
LLM judge vs human Pearson correlation
Position bias effect on constrained EM (base)
Who Should Care
What To Try In 7 Days
Start a small native French dataset collection (5k–10k expert pairs) as an anchor for further tuning.
Translate reliable English biomedical instruction sets and combine them with native samples in a 1:1 ratio for pilot fine-tuning.
Use a domain-specialized LLM (if available) to pre-screen synthetic/translated examples before training to reduce noise.
Optimization Features
Infra Optimization
- Used Qwen-4B to balance capacity and compute; experiments run on HPC resources
Model Optimization
- DoRA adapters applied to attention and FFN projections
Training Optimization
- SFT
- 10 epochs, batch size 12, LR=1e-4, cosine schedule
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use a single backbone (Qwen-4B-Instruct); results may differ at other scales.
- Synthetic examples were not filtered by quality score before fine-tuning, possibly leaving noisy pairs.
- Translated data dominate dataset size, which may hide content/topic mismatches despite good automatic translation scores.
When Not To Use
- If you have access to large-scale native expert-labeled French medical pairs; then prefer native-only supervision.
- When clinical safety requires strict human-verified answers: synthetic/translated data can introduce factual drift.
- For deployment without additional medical validation: models trained on mixed noisy data still need expert review.
Failure Modes
- Models may inherit factual errors from synthetic generations or translation artifacts.
- Evaluation metrics can be biased by response length (verbosity) or answer ordering (positional priors).
- Findings might not transfer to much larger LLMs or different pretraining distributions.
Core Entities
Models
- Qwen-4B-Instruct
- GPT-4o
- GPT-4o-mini
- GPT-4.1-mini
- Gemini 2.0 Flash
- MedGemma-27B
- HuatuoGPT-o1-72B
- Qwen3-Next-80B-A3B-Instruct
Metrics
- Exact Match (EM)
- Hamming score
- BLEU
- ROUGE-2
- METEOR
- BERTScore_F1
- Pearson r
Datasets
- MedInjection-FR
- FrenchMedMCQA
- MediQAl
- FrBMedQA
- DEFT-2021
- DIAMED
- MORFITT
- MedQA
- PubMedQA
- MedMCQA
- MMLU
- MedXpertQA
Benchmarks
- WMT 2024 Biomedical Translation
- MCQ/MCQU aggregated MCQA evaluation
- Open-ended QA (OEQ) evaluation
Context Entities
Models
- GPT-5
- GPT-4.1-mini
- GPT-4o-mini
- Gemini 2.0 Flash
Metrics
- Spearman's ρ
- Kendall's τ
Datasets
- WMT biomedical corpora
- S-Editions
- medical-cases-fr (mlabonne)
Benchmarks
- WMT 2024 biomedical translation task

