FedLLM-Bench: first realistic, user-split benchmark for federated fine-tuning of LLMs

June 7, 20246 min

Overview

Production Readiness

0.65

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

3

Authors

Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen

Links

Abstract / PDF

Why It Matters For Business

FedLLM-Bench gives engineering teams ready, realistic user-split data and baselines so they can test federated fine-tuning, compare FL optimizers, and measure privacy/utility trade-offs without building custom datasets.

Summary TLDR

This paper releases FedLLM-Bench: four real user-split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA), an integrated codebase with 8 federated training baselines, and 6 evaluation metrics. Experiments with Llama2-7B and Alpaca7B show that federated training usually outperforms local-only training on instruction tuning and preference alignment, but no single FL algorithm dominates every language or task. The repo and datasets are public for reproducible FedLLM work.

Problem Statement

Existing FedLLM work uses artificial dataset splits and inconsistent setups. That hides real-world client heterogeneity (language, size, quality, length, preferences) and blocks fair comparisons. The field needs realistic, user-split datasets and a standard benchmark.

Main Contribution

A realistic FedLLM benchmark with four naturally user-split datasets spanning 38–747 clients and 6k–53k samples.

An integrated codebase implementing 8 federated baselines and 6 evaluation metrics, released publicly.

Extensive experiments demonstrating federated gains, multilingual collaboration effects, and user-level differential privacy trade-offs.

Key Findings

Federated training improves average instruction-following compared to local-only training.

NumbersFed-ChatbotIT average score: Local 5.00 → FedAvg 5.51 (Δ +0.51) on open metrics

No single FL algorithm is uniformly best across languages and tasks.

NumbersFed-Aya averages: FedAvg 4.90, FedProx 4.92, SCAFFOLD 4.97 (per-language wins vary)

Federated training shows large absolute gains on some conversational benchmarks.

NumbersFed-WildChat Ref-GPT4: Local 4.50 → FedAvg 5.88 (Δ +1.38) for single-turn

User-level differential privacy can be applied with modest cost at some privacy settings.

NumbersFed-WildChat MT-Bench: FedAvg 4.6875 vs FedDP-0.1 4.5375 and local 4.0875

Results

Fed-ChatbotIT average (open+closed metrics)

ValueLocal 5.00 → FedAvg 5.51

BaselineLocal training (no FL)

Ref-GPT4 (single-turn)

ValueLocal 4.50 → FedAvg 5.88

BaselineLocal training

Multilingual average (Ref-GPT4)

ValueLocal per-language avg ≈ 4.2–4.7 → FedAvg 4.90

BaselineLocal per-language training

MT-Bench under DP (WildChat)

ValueFedAvg 4.6875; FedDP-0.1 4.5375; local 4.0875

BaselineFedAvg without DP and local training

Who Should Care

What To Try In 7 Days

Run FedAvg on one dataset to reproduce the reported average gains over local training.

Evaluate 2–3 FL optimizers (FedAvg, FedProx, SCAFFOLD) on your language mix and pick the best.

Test user-level DP at a mid-range epsilon (e.g., 0.1) to measure utility drop vs privacy benefit.

Optimization Features

Training Optimization

  • LoRA
  • 8-bit quantization on base models

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use only Llama2-7B for instruction tuning and Alpaca7B for preference alignment; results may not generalize to other model families and sizes.
  • Safety and red-teaming are not comprehensively covered; raw chats may include unsafe content left intact for research.

When Not To Use

  • If you need benchmarks across many model families and sizes beyond Llama2/Alpaca.
  • If you require a fully curated, safety-filtered dataset for production deployment.

Failure Modes

  • Algorithm ranking may change with different base models or hyperparameters.
  • GPT-4 judging (Ref-GPT4) can encode judge bias and metric noise.
  • Client datasets still reflect source biases and may not represent all deployment user bases.

Core Entities

Models

  • Llama2-7B
  • Alpaca7B
  • text-embedding-ada-002
  • GPT-4

Metrics

  • MT-Bench
  • Vicuna
  • AdvBench
  • Ref-GPT4
  • MMLU
  • HumanEval

Datasets

  • Fed-Aya
  • Fed-ChatbotIT
  • Fed-WildChat
  • Fed-ChatbotPA

Benchmarks

  • FedLLM-Bench
  • LEAF