Overview
The paper provides controlled experiments and multiple evaluation modalities; findings are robust for Qwen-4B but limited to one backbone and specific datasets.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 40%
Why It Matters For Business
When native-domain instruction data are scarce, mixing translated and synthetic examples with limited native supervision can deliver competitive model performance while lowering data-collection costs.
Who Should Care
Summary TLDR
The authors release MedInjection-FR, a 571K-pair French biomedical instruction dataset made from native (77K), synthetic (76K), and translated (417K) sources. They fine-tune Qwen-4B-Instruct under seven controlled mixes (33,493 examples per setup) and find: native data gives the biggest single-source gains; mixing native with translated or synthetic data often improves results further; synthetic-only performs worst but helps when balanced with native. Evaluation combines automatic metrics, LLM-as-a-judge, and human expert checks and flags evaluation caveats like verbosity and positional bias.
Problem Statement
High-quality French biomedical instruction pairs are scarce. Can translated or LLM-generated synthetic data substitute for native French supervision when tuning LLMs for medical tasks?
Main Contribution
Release MedInjection-FR: 571,436 French biomedical instruction-response pairs that mix native, synthetic, and translated sources.
Design a controlled ablation framework (seven equal-size configurations) to isolate the effect of data provenance on instruction tuning.
Key Findings
MedInjection-FR totals 571,436 instruction-response pairs composed of native, synthetic, and translated sources.
Native-only fine-tuning gives the best single-source MCQA performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| MedInjection-FR size by source | Total=571,436; Translated=417,674; Native=77,247; Synthetic=76,506 | — | — | Table 1 | Section 3.1, Table 1 | Table 1 |
| Aggregated MCQA Exact Match (constrained decoding) | NAT=40.59; TRAD=36.44; SYN=29.73; NAT-TRAD=41.37; ALL=40.97; Base=37.31 | Qwen-4B base=37.31 | NAT vs Base: +3.28; NAT-TRAD vs Base: +4.06 | Aggregated MCQ+MCQU (Table 5) | Section 5.1, Table 5 | Table 5 |
What To Try In 7 Days
Start a small native French dataset collection (5k–10k expert pairs) as an anchor for further tuning.
Translate reliable English biomedical instruction sets and combine them with native samples in a 1:1 ratio for pilot fine-tuning.
Use a domain-specialized LLM (if available) to pre-screen synthetic/translated examples before training to reduce noise.
Optimization Features
Infra Optimization
Model Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments use a single backbone (Qwen-4B-Instruct); results may differ at other scales.
Synthetic examples were not filtered by quality score before fine-tuning, possibly leaving noisy pairs.
When Not To Use
If you have access to large-scale native expert-labeled French medical pairs; then prefer native-only supervision.
When clinical safety requires strict human-verified answers: synthetic/translated data can introduce factual drift.
Failure Modes
Models may inherit factual errors from synthetic generations or translation artifacts.
Evaluation metrics can be biased by response length (verbosity) or answer ordering (positional priors).

