Overview
The benchmark is practical and reproducible with public code and datasets; conclusions are supported by many model runs but use a single unified hyperparameter protocol without per-model tuning.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
FlowerTune shows federated adapter tuning can build domain-specialized LLMs without sharing raw data while keeping communication and memory within practical limits.
Who Should Care
Summary TLDR
FlowerTune builds an open, community-driven benchmark and leaderboard to evaluate federated fine-tuning of LLMs across four domains (general NLP, finance, medical, coding). It provides federated instruction datasets, unified pipelines, and baseline results for 26 models (135M–14B params) using PEFT adapters (LoRA/DoRA), 4-bit quantization, and standard FL algorithms. Key takeaways: model choice matters more than aggregation method; instruct-tuned models outperform non-instruct ones; small models can be viable in resource-constrained FL with adapter tuning.
Problem Statement
Can pre-trained LLMs be fine-tuned in federated (data-private) settings across domains, and which models, adapter methods, and aggregation strategies give the best trade-off between accuracy, communication, and memory under realistic cross-silo constraints?
Main Contribution
FlowerTune LLM Leaderboard: open benchmark and pipelines for federated instruction fine-tuning across four domains (general NLP, finance, medical, code).
Federated datasets and standardized splits that simulate cross-institution data (20–50 clients per domain).
Key Findings
Instruct-tuned base models outperform non-instruct counterparts under the same federated adapter tuning.
Model choice has larger impact than aggregation algorithm or adapter variant.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 64.50% | — | — | MMLU (STEM, Social, Humanities) | Table 3 shows Qwen2.5-7B-Instruct avg 64.50% | Table 3 |
| Accuracy | 84.21% | — | — | FPB / FIQA / TFNS | Table 4 reports Gemma2-9B-Instruct avg 84.21% | Table 4 |
What To Try In 7 Days
Run the FlowerTune baseline on your domain using the provided template to reproduce results.
Compare an instruct and non-instruct base model on a held-out domain sample to check gain from instruct tuning.
Test LoRA vs DoRA adapters with 4-bit quant to measure communication and VRAM on your infra (report GBs).
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Focus is cross-silo FL; results may not transfer to extreme cross-device edge cases.
Unified hyperparameters used for fairness; per-model tuning could improve results.
When Not To Use
If you require very large models (>14B) without adapter reductions.
When per-client on-device compute is extremely limited (microcontrollers) and latency matters more than VRAM footprint.
Failure Modes
Heterogeneous client data can cause uneven generalization across sub-datasets (noted in medical and code domains).
Communication spikes if client count or adapter size increases.

