FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

June 3, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible with public code and datasets; conclusions are supported by many model runs but use a single unified hyperparameter protocol without per-model tuning.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, Nicholas D. Lane

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FlowerTune shows federated adapter tuning can build domain-specialized LLMs without sharing raw data while keeping communication and memory within practical limits.

Who Should Care

Summary TLDR

FlowerTune builds an open, community-driven benchmark and leaderboard to evaluate federated fine-tuning of LLMs across four domains (general NLP, finance, medical, coding). It provides federated instruction datasets, unified pipelines, and baseline results for 26 models (135M–14B params) using PEFT adapters (LoRA/DoRA), 4-bit quantization, and standard FL algorithms. Key takeaways: model choice matters more than aggregation method; instruct-tuned models outperform non-instruct ones; small models can be viable in resource-constrained FL with adapter tuning.

Problem Statement

Can pre-trained LLMs be fine-tuned in federated (data-private) settings across domains, and which models, adapter methods, and aggregation strategies give the best trade-off between accuracy, communication, and memory under realistic cross-silo constraints?

Main Contribution

FlowerTune LLM Leaderboard: open benchmark and pipelines for federated instruction fine-tuning across four domains (general NLP, finance, medical, code).

Federated datasets and standardized splits that simulate cross-institution data (20–50 clients per domain).

Key Findings

Instruct-tuned base models outperform non-instruct counterparts under the same federated adapter tuning.

NumbersQwen2.5-7B instruct avg (GenNLP) 64.50% vs non-instruct 42.79% (Table 3 vs Table 12).

Practical UsePrefer instruct versions of base models when available; expect ~20–25 percentage-point lifts on evaluated tasks without extra hyperparameter work.

Evidence RefTables 3 and 12

Model choice has larger impact than aggregation algorithm or adapter variant.

NumbersAggregation yields marginal changes (GenNLP FedAvg 42.79% vs FedProx 42.92%; Table 7).

Practical UseSpend time selecting and validating base models for your domain before tuning aggregation or adapter tweaks.

Evidence RefTable 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy64.50%MMLU (STEM, Social, Humanities)Table 3 shows Qwen2.5-7B-Instruct avg 64.50%Table 3
Accuracy84.21%FPB / FIQA / TFNSTable 4 reports Gemma2-9B-Instruct avg 84.21%Table 4

What To Try In 7 Days

Run the FlowerTune baseline on your domain using the provided template to reproduce results.

Compare an instruct and non-instruct base model on a held-out domain sample to check gain from instruct tuning.

Test LoRA vs DoRA adapters with 4-bit quant to measure communication and VRAM on your infra (report GBs).

Optimization Features

Infra Optimization
experiments target single A100/H100 GPU feasibilityDoRA rank/alpha tuned for very large models to fit single GPU
Model Optimization
LoRA
System Optimization
selective client sampling (20% per round)low-rank adapters to reduce transmitted bytes
Training Optimization
parameter-efficient fine-tuning (fewer trainable params)FlashAttention-2 for faster attention computation

Reproducibility

Risks & Boundaries

Limitations

Focus is cross-silo FL; results may not transfer to extreme cross-device edge cases.

Unified hyperparameters used for fairness; per-model tuning could improve results.

When Not To Use

If you require very large models (>14B) without adapter reductions.

When per-client on-device compute is extremely limited (microcontrollers) and latency matters more than VRAM footprint.

Failure Modes

Heterogeneous client data can cause uneven generalization across sub-datasets (noted in medical and code domains).

Communication spikes if client count or adapter size increases.

Core Entities

Models

Gemma2-9B-InstructQwen2.5-7B-InstructPhi-4-mini-InstructMistral-7B-Instruct-v0.3Llama3.1-8B-InstructSmolLM2-135M-InstructMistral-24B-Instruct-2501

Metrics

AccuracyPass@1Communication (GB)VRAM (GB)

Datasets

alpaca-gpt4 (general NLP)fingpt-sentiment-train (finance)medical-flashcards (medical)code-alpaca-20k (code)MMLUFPBFIQATFNSPubMedQAMedMCQAMedQACareQAMBPPHumanEvalMultiPL-E

Benchmarks

FlowerTune LLM Leaderboard

Context Entities

Models

Gemma-2-9BQwen2.5-3BPhi-4 (14B)Mistral-7BSmolLM2 familyMistral-Small-24B

Metrics

pass@1Accuracy

Datasets

alpaca-gpt4code-alpaca-20kfingpt-sentiment-trainmedical-flashcards

Benchmarks

FedLLM-BenchFedScope-LLMOpenFedLLM