Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
FlowerTune shows federated adapter tuning can build domain-specialized LLMs without sharing raw data while keeping communication and memory within practical limits.
Summary TLDR
FlowerTune builds an open, community-driven benchmark and leaderboard to evaluate federated fine-tuning of LLMs across four domains (general NLP, finance, medical, coding). It provides federated instruction datasets, unified pipelines, and baseline results for 26 models (135M–14B params) using PEFT adapters (LoRA/DoRA), 4-bit quantization, and standard FL algorithms. Key takeaways: model choice matters more than aggregation method; instruct-tuned models outperform non-instruct ones; small models can be viable in resource-constrained FL with adapter tuning.
Problem Statement
Can pre-trained LLMs be fine-tuned in federated (data-private) settings across domains, and which models, adapter methods, and aggregation strategies give the best trade-off between accuracy, communication, and memory under realistic cross-silo constraints?
Main Contribution
FlowerTune LLM Leaderboard: open benchmark and pipelines for federated instruction fine-tuning across four domains (general NLP, finance, medical, code).
Federated datasets and standardized splits that simulate cross-institution data (20–50 clients per domain).
Large empirical study: federated adapter fine-tuning (LoRA/DoRA + 4-bit quant) on 26 base models (135M–14B) with system metrics (communication, VRAM).
Analysis of aggregation algorithms (FedAvg, FedProx, FedAvgM, FlexLoRA) and adapter strategies showing model choice dominates performance differences.
Key Findings
Instruct-tuned base models outperform non-instruct counterparts under the same federated adapter tuning.
Model choice has larger impact than aggregation algorithm or adapter variant.
Domain winners vary, but larger instruct models often lead performance; Gemma2-9B-Instruct excels on finance and medical, Qwen2.5-7B and Phi-4-mini strong on GenNLP and code.
Adapter-based PEFT with DoRA/LoRA and 4-bit quantization keeps communication and memory manageable.
Smaller models can match or beat larger ones on simple classification tasks under FL.
Results
Accuracy
Accuracy
Accuracy
Coding average pass@1 (Gemma2-9B-Instruct)
System memory (SmolLM2-135M-Instruct)
Aggregation variation
Who Should Care
What To Try In 7 Days
Run the FlowerTune baseline on your domain using the provided template to reproduce results.
Compare an instruct and non-instruct base model on a held-out domain sample to check gain from instruct tuning.
Test LoRA vs DoRA adapters with 4-bit quant to measure communication and VRAM on your infra (report GBs).
Optimization Features
Infra Optimization
- experiments target single A100/H100 GPU feasibility
- DoRA rank/alpha tuned for very large models to fit single GPU
Model Optimization
- LoRA
System Optimization
- selective client sampling (20% per round)
- low-rank adapters to reduce transmitted bytes
Training Optimization
- parameter-efficient fine-tuning (fewer trainable params)
- FlashAttention-2 for faster attention computation
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Focus is cross-silo FL; results may not transfer to extreme cross-device edge cases.
- Unified hyperparameters used for fairness; per-model tuning could improve results.
- Only models ≤14B evaluated in main experiments; 24B results are limited and need more study.
When Not To Use
- If you require very large models (>14B) without adapter reductions.
- When per-client on-device compute is extremely limited (microcontrollers) and latency matters more than VRAM footprint.
- If you need per-model hyperparameter tuning for top accuracy out of the box.
Failure Modes
- Heterogeneous client data can cause uneven generalization across sub-datasets (noted in medical and code domains).
- Communication spikes if client count or adapter size increases.
- Adapters may not capture domain nuances as well as full fine-tuning for some tasks.
Core Entities
Models
- Gemma2-9B-Instruct
- Qwen2.5-7B-Instruct
- Phi-4-mini-Instruct
- Mistral-7B-Instruct-v0.3
- Llama3.1-8B-Instruct
- SmolLM2-135M-Instruct
- Mistral-24B-Instruct-2501
Metrics
- Accuracy
- Pass@1
- Communication (GB)
- VRAM (GB)
Datasets
- alpaca-gpt4 (general NLP)
- fingpt-sentiment-train (finance)
- medical-flashcards (medical)
- code-alpaca-20k (code)
- MMLU
- FPB
- FIQA
- TFNS
- PubMedQA
- MedMCQA
- MedQA
- CareQA
- MBPP
- HumanEval
- MultiPL-E
Benchmarks
- FlowerTune LLM Leaderboard
Context Entities
Models
- Gemma-2-9B
- Qwen2.5-3B
- Phi-4 (14B)
- Mistral-7B
- SmolLM2 family
- Mistral-Small-24B
Metrics
- pass@1
- Accuracy
Datasets
- alpaca-gpt4
- code-alpaca-20k
- fingpt-sentiment-train
- medical-flashcards
Benchmarks
- FedLLM-Bench
- FedScope-LLM
- OpenFedLLM

