FlowerTune: an open leaderboard to benchmark federated fine-tuning of LLMs across NLP, finance, medical and code

Overview

Decision SnapshotNeeds Validation

The benchmark is practical and reproducible with public code and datasets; conclusions are supported by many model runs but use a single unified hyperparameter protocol without per-model tuning.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Yan Gao, Massimo Roberto Scamarcia, Javier Fernandez-Marques, Mohammad Naseri, Chong Shen Ng, Dimitris Stripelis, Zexi Li, Tao Shen, Jiamu Bai, Daoyuan Chen, Zikai Zhang, Rui Hu, InSeo Song, Lee KangYoon, Hong Jia, Ting Dang, Junyan Wang, Zheyuan Liu, Daniel Janes Beutel, Lingjuan Lyu, Nicholas D. Lane

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FlowerTune shows federated adapter tuning can build domain-specialized LLMs without sharing raw data while keeping communication and memory within practical limits.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead

Summary TLDR

FlowerTune builds an open, community-driven benchmark and leaderboard to evaluate federated fine-tuning of LLMs across four domains (general NLP, finance, medical, coding). It provides federated instruction datasets, unified pipelines, and baseline results for 26 models (135M–14B params) using PEFT adapters (LoRA/DoRA), 4-bit quantization, and standard FL algorithms. Key takeaways: model choice matters more than aggregation method; instruct-tuned models outperform non-instruct ones; small models can be viable in resource-constrained FL with adapter tuning.

Problem Statement

Can pre-trained LLMs be fine-tuned in federated (data-private) settings across domains, and which models, adapter methods, and aggregation strategies give the best trade-off between accuracy, communication, and memory under realistic cross-silo constraints?

Main Contribution

FlowerTune LLM Leaderboard: open benchmark and pipelines for federated instruction fine-tuning across four domains (general NLP, finance, medical, code).

Federated datasets and standardized splits that simulate cross-institution data (20–50 clients per domain).

Key Findings

Instruct-tuned base models outperform non-instruct counterparts under the same federated adapter tuning.

NumbersQwen2.5-7B instruct avg (GenNLP) 64.50% vs non-instruct 42.79% (Table 3 vs Table 12).

Practical UsePrefer instruct versions of base models when available; expect ~20–25 percentage-point lifts on evaluated tasks without extra hyperparameter work.

Evidence RefTables 3 and 12

Model choice has larger impact than aggregation algorithm or adapter variant.

NumbersAggregation yields marginal changes (GenNLP FedAvg 42.79% vs FedProx 42.92%; Table 7).

Practical UseSpend time selecting and validating base models for your domain before tuning aggregation or adapter tweaks.

Evidence RefTable 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	64.50%	—	—	MMLU (STEM, Social, Humanities)	Table 3 shows Qwen2.5-7B-Instruct avg 64.50%	Table 3
Accuracy	84.21%	—	—	FPB / FIQA / TFNS	Table 4 reports Gemma2-9B-Instruct avg 84.21%	Table 4

What To Try In 7 Days

Run the FlowerTune baseline on your domain using the provided template to reproduce results.

Compare an instruct and non-instruct base model on a held-out domain sample to check gain from instruct tuning.

Test LoRA vs DoRA adapters with 4-bit quant to measure communication and VRAM on your infra (report GBs).

Optimization Features

Infra Optimization

experiments target single A100/H100 GPU feasibilityDoRA rank/alpha tuned for very large models to fit single GPU

Model Optimization

LoRA

System Optimization

selective client sampling (20% per round)low-rank adapters to reduce transmitted bytes

Training Optimization

parameter-efficient fine-tuning (fewer trainable params)FlashAttention-2 for faster attention computation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/yan-gao-GY/flowertune-benchmark https://flower.ai/benchmarks/llm-leaderboard

Data URLs

https://huggingface.co/datasets/flwrlabs/alpaca-gpt4 https://huggingface.co/datasets/flwrlabs/fingpt-sentiment-train https://huggingface.co/datasets/flwrlabs/medical-meadow-medical-flashcards https://huggingface.co/datasets/flwrlabs/code-alpaca-20k

Risks & Boundaries

Limitations

Focus is cross-silo FL; results may not transfer to extreme cross-device edge cases.

Unified hyperparameters used for fairness; per-model tuning could improve results.

When Not To Use

If you require very large models (>14B) without adapter reductions.

When per-client on-device compute is extremely limited (microcontrollers) and latency matters more than VRAM footprint.

Failure Modes

Heterogeneous client data can cause uneven generalization across sub-datasets (noted in medical and code domains).

Communication spikes if client count or adapter size increases.

Core Entities

Models

Gemma2-9B-InstructQwen2.5-7B-InstructPhi-4-mini-InstructMistral-7B-Instruct-v0.3Llama3.1-8B-InstructSmolLM2-135M-InstructMistral-24B-Instruct-2501

Metrics

AccuracyPass@1Communication (GB)VRAM (GB)

Datasets

alpaca-gpt4 (general NLP)fingpt-sentiment-train (finance)medical-flashcards (medical)code-alpaca-20k (code)MMLUFPBFIQATFNSPubMedQAMedMCQAMedQACareQAMBPPHumanEvalMultiPL-E

Benchmarks

FlowerTune LLM Leaderboard

Context Entities

Models

Gemma-2-9BQwen2.5-3BPhi-4 (14B)Mistral-7BSmolLM2 familyMistral-Small-24B

Metrics

pass@1Accuracy

Datasets

alpaca-gpt4code-alpaca-20kfingpt-sentiment-trainmedical-flashcards

Benchmarks

FedLLM-BenchFedScope-LLMOpenFedLLM

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruct-tuned base models outperform non-instruct counterparts under the same federated adapter tuning.

Model choice has larger impact than aggregation algorithm or adapter variant.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding