Overview
The benchmark provides practical, user-split datasets and baselines that reliably show federated gains on evaluated tasks. It is well suited for prototyping and comparative research, but has limited model diversity (Llama2/Alpaca only) and does not cover full production-scale safety testing.
Citations3
Evidence Strength0.70
Confidence0.90
Risk Signals7
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 65%
Novelty: 50%
Why It Matters For Business
FedLLM-Bench gives engineering teams ready, realistic user-split data and baselines so they can test federated fine-tuning, compare FL optimizers, and measure privacy/utility trade-offs without building custom datasets.
Who Should Care
Summary TLDR
This paper releases FedLLM-Bench: four real user-split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA), an integrated codebase with 8 federated training baselines, and 6 evaluation metrics. Experiments with Llama2-7B and Alpaca7B show that federated training usually outperforms local-only training on instruction tuning and preference alignment, but no single FL algorithm dominates every language or task. The repo and datasets are public for reproducible FedLLM work.
Problem Statement
Existing FedLLM work uses artificial dataset splits and inconsistent setups. That hides real-world client heterogeneity (language, size, quality, length, preferences) and blocks fair comparisons. The field needs realistic, user-split datasets and a standard benchmark.
Main Contribution
A realistic FedLLM benchmark with four naturally user-split datasets spanning 38–747 clients and 6k–53k samples.
An integrated codebase implementing 8 federated baselines and 6 evaluation metrics, released publicly.
Key Findings
Federated training improves average instruction-following compared to local-only training.
No single FL algorithm is uniformly best across languages and tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Fed-ChatbotIT average (open+closed metrics) | Local 5.00 → FedAvg 5.51 | Local training (no FL) | +0.51 | Fed-ChatbotIT | Table 3 average column | Table 3 |
| Ref-GPT4 (single-turn) | Local 4.50 → FedAvg 5.88 | Local training | +1.38 | Fed-WildChat (single-turn) | Table 4 single-turn Ref-GPT4 | Table 4 |
What To Try In 7 Days
Run FedAvg on one dataset to reproduce the reported average gains over local training.
Evaluate 2–3 FL optimizers (FedAvg, FedProx, SCAFFOLD) on your language mix and pick the best.
Test user-level DP at a mid-range epsilon (e.g., 0.1) to measure utility drop vs privacy benefit.
Optimization Features
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments use only Llama2-7B for instruction tuning and Alpaca7B for preference alignment; results may not generalize to other model families and sizes.
Safety and red-teaming are not comprehensively covered; raw chats may include unsafe content left intact for research.
When Not To Use
If you need benchmarks across many model families and sizes beyond Llama2/Alpaca.
If you require a fully curated, safety-filtered dataset for production deployment.
Failure Modes
Algorithm ranking may change with different base models or hyperparameters.
GPT-4 judging (Ref-GPT4) can encode judge bias and metric noise.

