FedLLM-Bench: first realistic, user-split benchmark for federated fine-tuning of LLMs

June 7, 20246 min

Overview

Decision SnapshotNeeds Validation

The benchmark provides practical, user-split datasets and baselines that reliably show federated gains on evaluated tasks. It is well suited for prototyping and comparative research, but has limited model diversity (Llama2/Alpaca only) and does not cover full production-scale safety testing.

Citations3

Evidence Strength0.70

Confidence0.90

Risk Signals7

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 50%

Authors

Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FedLLM-Bench gives engineering teams ready, realistic user-split data and baselines so they can test federated fine-tuning, compare FL optimizers, and measure privacy/utility trade-offs without building custom datasets.

Who Should Care

Summary TLDR

This paper releases FedLLM-Bench: four real user-split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA), an integrated codebase with 8 federated training baselines, and 6 evaluation metrics. Experiments with Llama2-7B and Alpaca7B show that federated training usually outperforms local-only training on instruction tuning and preference alignment, but no single FL algorithm dominates every language or task. The repo and datasets are public for reproducible FedLLM work.

Problem Statement

Existing FedLLM work uses artificial dataset splits and inconsistent setups. That hides real-world client heterogeneity (language, size, quality, length, preferences) and blocks fair comparisons. The field needs realistic, user-split datasets and a standard benchmark.

Main Contribution

A realistic FedLLM benchmark with four naturally user-split datasets spanning 38–747 clients and 6k–53k samples.

An integrated codebase implementing 8 federated baselines and 6 evaluation metrics, released publicly.

Key Findings

Federated training improves average instruction-following compared to local-only training.

NumbersFed-ChatbotIT average score: Local 5.00 → FedAvg 5.51+0.51) on open metrics

Practical UseIf you can pool models via FL, expect modest but consistent quality gains over isolated local fine-tuning.

Evidence RefTable 3 (Fed-ChatbotIT averages)

No single FL algorithm is uniformly best across languages and tasks.

NumbersFed-Aya averages: FedAvg 4.90, FedProx 4.92, SCAFFOLD 4.97 (per-language wins vary)

Practical UseRun a small algorithm sweep (FedAvg, FedProx, SCAFFOLD) per deployment; pick the one that fits your language/task mix.

Evidence RefTable 2 (Fed-Aya per-language results)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Fed-ChatbotIT average (open+closed metrics)Local 5.00 → FedAvg 5.51Local training (no FL)+0.51Fed-ChatbotITTable 3 average columnTable 3
Ref-GPT4 (single-turn)Local 4.50 → FedAvg 5.88Local training+1.38Fed-WildChat (single-turn)Table 4 single-turn Ref-GPT4Table 4

What To Try In 7 Days

Run FedAvg on one dataset to reproduce the reported average gains over local training.

Evaluate 2–3 FL optimizers (FedAvg, FedProx, SCAFFOLD) on your language mix and pick the best.

Test user-level DP at a mid-range epsilon (e.g., 0.1) to measure utility drop vs privacy benefit.

Optimization Features

Training Optimization
LoRA8-bit quantization on base models

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments use only Llama2-7B for instruction tuning and Alpaca7B for preference alignment; results may not generalize to other model families and sizes.

Safety and red-teaming are not comprehensively covered; raw chats may include unsafe content left intact for research.

When Not To Use

If you need benchmarks across many model families and sizes beyond Llama2/Alpaca.

If you require a fully curated, safety-filtered dataset for production deployment.

Failure Modes

Algorithm ranking may change with different base models or hyperparameters.

GPT-4 judging (Ref-GPT4) can encode judge bias and metric noise.

Core Entities

Models

Llama2-7BAlpaca7Btext-embedding-ada-002GPT-4

Metrics

MT-BenchVicunaAdvBenchRef-GPT4MMLUHumanEval

Datasets

Fed-AyaFed-ChatbotITFed-WildChatFed-ChatbotPA

Benchmarks

FedLLM-BenchLEAF