FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

June 3, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, Nicholas D. Lane

Links

Abstract / PDF

Why It Matters For Business

FlowerTune shows federated adapter tuning can build domain-specialized LLMs without sharing raw data while keeping communication and memory within practical limits.

Summary TLDR

FlowerTune builds an open, community-driven benchmark and leaderboard to evaluate federated fine-tuning of LLMs across four domains (general NLP, finance, medical, coding). It provides federated instruction datasets, unified pipelines, and baseline results for 26 models (135M–14B params) using PEFT adapters (LoRA/DoRA), 4-bit quantization, and standard FL algorithms. Key takeaways: model choice matters more than aggregation method; instruct-tuned models outperform non-instruct ones; small models can be viable in resource-constrained FL with adapter tuning.

Problem Statement

Can pre-trained LLMs be fine-tuned in federated (data-private) settings across domains, and which models, adapter methods, and aggregation strategies give the best trade-off between accuracy, communication, and memory under realistic cross-silo constraints?

Main Contribution

FlowerTune LLM Leaderboard: open benchmark and pipelines for federated instruction fine-tuning across four domains (general NLP, finance, medical, code).

Federated datasets and standardized splits that simulate cross-institution data (20–50 clients per domain).

Large empirical study: federated adapter fine-tuning (LoRA/DoRA + 4-bit quant) on 26 base models (135M–14B) with system metrics (communication, VRAM).

Analysis of aggregation algorithms (FedAvg, FedProx, FedAvgM, FlexLoRA) and adapter strategies showing model choice dominates performance differences.

Key Findings

Instruct-tuned base models outperform non-instruct counterparts under the same federated adapter tuning.

NumbersQwen2.5-7B instruct avg (GenNLP) 64.50% vs non-instruct 42.79% (Table 3 vs Table 12).

Model choice has larger impact than aggregation algorithm or adapter variant.

NumbersAggregation yields marginal changes (GenNLP FedAvg 42.79% vs FedProx 42.92%; Table 7).

Domain winners vary, but larger instruct models often lead performance; Gemma2-9B-Instruct excels on finance and medical, Qwen2.5-7B and Phi-4-mini strong on GenNLP and code.

NumbersGemma2-9B finance avg 84.21% (Table 4); Gemma2-9B medical avg 62.25% (Table 5); Qwen2.5-7B GenNLP avg 64.50% (Table 3);

Adapter-based PEFT with DoRA/LoRA and 4-bit quantization keeps communication and memory manageable.

NumbersMany models fit <80 GB VRAM; SmolLM2-135M uses ~7–9 GB VRAM and very low comm (1–3 GB) (Tables 3–6).

Smaller models can match or beat larger ones on simple classification tasks under FL.

NumbersOn finance, SmolLM2-1.7B avg 48.94% and some small models outperform larger ones on specific datasets (Table 4).

Results

Accuracy

Value64.50%

Accuracy

Value84.21%

Accuracy

Value62.25%

Coding average pass@1 (Gemma2-9B-Instruct)

Value53.29%

System memory (SmolLM2-135M-Instruct)

Value≈7–9 GB VRAM per client

Aggregation variation

Value≤~2 percentage points

Who Should Care

What To Try In 7 Days

Run the FlowerTune baseline on your domain using the provided template to reproduce results.

Compare an instruct and non-instruct base model on a held-out domain sample to check gain from instruct tuning.

Test LoRA vs DoRA adapters with 4-bit quant to measure communication and VRAM on your infra (report GBs).

Optimization Features

Infra Optimization

  • experiments target single A100/H100 GPU feasibility
  • DoRA rank/alpha tuned for very large models to fit single GPU

Model Optimization

  • LoRA

System Optimization

  • selective client sampling (20% per round)
  • low-rank adapters to reduce transmitted bytes

Training Optimization

  • parameter-efficient fine-tuning (fewer trainable params)
  • FlashAttention-2 for faster attention computation

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Focus is cross-silo FL; results may not transfer to extreme cross-device edge cases.
  • Unified hyperparameters used for fairness; per-model tuning could improve results.
  • Only models ≤14B evaluated in main experiments; 24B results are limited and need more study.

When Not To Use

  • If you require very large models (>14B) without adapter reductions.
  • When per-client on-device compute is extremely limited (microcontrollers) and latency matters more than VRAM footprint.
  • If you need per-model hyperparameter tuning for top accuracy out of the box.

Failure Modes

  • Heterogeneous client data can cause uneven generalization across sub-datasets (noted in medical and code domains).
  • Communication spikes if client count or adapter size increases.
  • Adapters may not capture domain nuances as well as full fine-tuning for some tasks.

Core Entities

Models

  • Gemma2-9B-Instruct
  • Qwen2.5-7B-Instruct
  • Phi-4-mini-Instruct
  • Mistral-7B-Instruct-v0.3
  • Llama3.1-8B-Instruct
  • SmolLM2-135M-Instruct
  • Mistral-24B-Instruct-2501

Metrics

  • Accuracy
  • Pass@1
  • Communication (GB)
  • VRAM (GB)

Datasets

  • alpaca-gpt4 (general NLP)
  • fingpt-sentiment-train (finance)
  • medical-flashcards (medical)
  • code-alpaca-20k (code)
  • MMLU
  • FPB
  • FIQA
  • TFNS
  • PubMedQA
  • MedMCQA
  • MedQA
  • CareQA
  • MBPP
  • HumanEval
  • MultiPL-E

Benchmarks

  • FlowerTune LLM Leaderboard

Context Entities

Models

  • Gemma-2-9B
  • Qwen2.5-3B
  • Phi-4 (14B)
  • Mistral-7B
  • SmolLM2 family
  • Mistral-Small-24B

Metrics

  • pass@1
  • Accuracy

Datasets

  • alpaca-gpt4
  • code-alpaca-20k
  • fingpt-sentiment-train
  • medical-flashcards

Benchmarks

  • FedLLM-Bench
  • FedScope-LLM
  • OpenFedLLM