FedLLM-Bench: first realistic, user-split benchmark for federated fine-tuning of LLMs

Overview

Decision SnapshotNeeds Validation

The benchmark provides practical, user-split datasets and baselines that reliably show federated gains on evaluated tasks. It is well suited for prototyping and comparative research, but has limited model diversity (Llama2/Alpaca only) and does not cover full production-scale safety testing.

Citations3

Evidence Strength0.70

Confidence0.90

Risk Signals7

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 50%

Authors

Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FedLLM-Bench gives engineering teams ready, realistic user-split data and baselines so they can test federated fine-tuning, compare FL optimizers, and measure privacy/utility trade-offs without building custom datasets.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

This paper releases FedLLM-Bench: four real user-split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA), an integrated codebase with 8 federated training baselines, and 6 evaluation metrics. Experiments with Llama2-7B and Alpaca7B show that federated training usually outperforms local-only training on instruction tuning and preference alignment, but no single FL algorithm dominates every language or task. The repo and datasets are public for reproducible FedLLM work.

Problem Statement

Existing FedLLM work uses artificial dataset splits and inconsistent setups. That hides real-world client heterogeneity (language, size, quality, length, preferences) and blocks fair comparisons. The field needs realistic, user-split datasets and a standard benchmark.

Main Contribution

A realistic FedLLM benchmark with four naturally user-split datasets spanning 38–747 clients and 6k–53k samples.

An integrated codebase implementing 8 federated baselines and 6 evaluation metrics, released publicly.

Key Findings

Federated training improves average instruction-following compared to local-only training.

NumbersFed-ChatbotIT average score: Local 5.00 → FedAvg 5.51 (Δ +0.51) on open metrics

Practical UseIf you can pool models via FL, expect modest but consistent quality gains over isolated local fine-tuning.

Evidence RefTable 3 (Fed-ChatbotIT averages)

No single FL algorithm is uniformly best across languages and tasks.

NumbersFed-Aya averages: FedAvg 4.90, FedProx 4.92, SCAFFOLD 4.97 (per-language wins vary)

Practical UseRun a small algorithm sweep (FedAvg, FedProx, SCAFFOLD) per deployment; pick the one that fits your language/task mix.

Evidence RefTable 2 (Fed-Aya per-language results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Fed-ChatbotIT average (open+closed metrics)	Local 5.00 → FedAvg 5.51	Local training (no FL)	+0.51	Fed-ChatbotIT	Table 3 average column	Table 3
Ref-GPT4 (single-turn)	Local 4.50 → FedAvg 5.88	Local training	+1.38	Fed-WildChat (single-turn)	Table 4 single-turn Ref-GPT4	Table 4

What To Try In 7 Days

Run FedAvg on one dataset to reproduce the reported average gains over local training.

Evaluate 2–3 FL optimizers (FedAvg, FedProx, SCAFFOLD) on your language mix and pick the best.

Test user-level DP at a mid-range epsilon (e.g., 0.1) to measure utility drop vs privacy benefit.

Optimization Features

Training Optimization

LoRA8-bit quantization on base models

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/rui-ye/FedLLM-Bench

Data URLs

https://github.com/rui-ye/FedLLM-Bench

Risks & Boundaries

Limitations

Experiments use only Llama2-7B for instruction tuning and Alpaca7B for preference alignment; results may not generalize to other model families and sizes.

Safety and red-teaming are not comprehensively covered; raw chats may include unsafe content left intact for research.

When Not To Use

If you need benchmarks across many model families and sizes beyond Llama2/Alpaca.

If you require a fully curated, safety-filtered dataset for production deployment.

Failure Modes

Algorithm ranking may change with different base models or hyperparameters.

GPT-4 judging (Ref-GPT4) can encode judge bias and metric noise.

Core Entities

Models

Llama2-7BAlpaca7Btext-embedding-ada-002GPT-4

Metrics

MT-BenchVicunaAdvBenchRef-GPT4MMLUHumanEval

Datasets

Fed-AyaFed-ChatbotITFed-WildChatFed-ChatbotPA

Benchmarks

FedLLM-BenchLEAF

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Federated training improves average instruction-following compared to local-only training.

No single FL algorithm is uniformly best across languages and tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding