Overview
Production Readiness
0.65
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
3
Why It Matters For Business
FedLLM-Bench gives engineering teams ready, realistic user-split data and baselines so they can test federated fine-tuning, compare FL optimizers, and measure privacy/utility trade-offs without building custom datasets.
Summary TLDR
This paper releases FedLLM-Bench: four real user-split datasets (Fed-Aya, Fed-ChatbotIT, Fed-WildChat, Fed-ChatbotPA), an integrated codebase with 8 federated training baselines, and 6 evaluation metrics. Experiments with Llama2-7B and Alpaca7B show that federated training usually outperforms local-only training on instruction tuning and preference alignment, but no single FL algorithm dominates every language or task. The repo and datasets are public for reproducible FedLLM work.
Problem Statement
Existing FedLLM work uses artificial dataset splits and inconsistent setups. That hides real-world client heterogeneity (language, size, quality, length, preferences) and blocks fair comparisons. The field needs realistic, user-split datasets and a standard benchmark.
Main Contribution
A realistic FedLLM benchmark with four naturally user-split datasets spanning 38–747 clients and 6k–53k samples.
An integrated codebase implementing 8 federated baselines and 6 evaluation metrics, released publicly.
Extensive experiments demonstrating federated gains, multilingual collaboration effects, and user-level differential privacy trade-offs.
Key Findings
Federated training improves average instruction-following compared to local-only training.
No single FL algorithm is uniformly best across languages and tasks.
Federated training shows large absolute gains on some conversational benchmarks.
User-level differential privacy can be applied with modest cost at some privacy settings.
Results
Fed-ChatbotIT average (open+closed metrics)
Ref-GPT4 (single-turn)
Multilingual average (Ref-GPT4)
MT-Bench under DP (WildChat)
Who Should Care
What To Try In 7 Days
Run FedAvg on one dataset to reproduce the reported average gains over local training.
Evaluate 2–3 FL optimizers (FedAvg, FedProx, SCAFFOLD) on your language mix and pick the best.
Test user-level DP at a mid-range epsilon (e.g., 0.1) to measure utility drop vs privacy benefit.
Optimization Features
Training Optimization
- LoRA
- 8-bit quantization on base models
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use only Llama2-7B for instruction tuning and Alpaca7B for preference alignment; results may not generalize to other model families and sizes.
- Safety and red-teaming are not comprehensively covered; raw chats may include unsafe content left intact for research.
When Not To Use
- If you need benchmarks across many model families and sizes beyond Llama2/Alpaca.
- If you require a fully curated, safety-filtered dataset for production deployment.
Failure Modes
- Algorithm ranking may change with different base models or hyperparameters.
- GPT-4 judging (Ref-GPT4) can encode judge bias and metric noise.
- Client datasets still reflect source biases and may not represent all deployment user bases.
Core Entities
Models
- Llama2-7B
- Alpaca7B
- text-embedding-ada-002
- GPT-4
Metrics
- MT-Bench
- Vicuna
- AdvBench
- Ref-GPT4
- MMLU
- HumanEval
Datasets
- Fed-Aya
- Fed-ChatbotIT
- Fed-WildChat
- Fed-ChatbotPA
Benchmarks
- FedLLM-Bench
- LEAF

