Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
9
Why It Matters For Business
FS-LLM lets organizations co‑train LLMs across private data while cutting bandwidth and memory needs using PEFT and resource operators; this reduces cost and preserves IP when the full model must stay closed.
Summary TLDR
This paper releases FS-LLM, an open-source system for fine-tuning large language models (LLMs) in federated learning (FL). FS-LLM bundles (1) LLM-BENCHMARKS: curated federated datasets and evaluation tasks, (2) LLM-ALGZOO: implementations of parameter-efficient fine-tuning (PEFT) methods (LoRA, P-tuning, prompt tuning) and a closed-model workflow (FedOT), and (3) LLM-TRAINER: hookable training operators (mixed precision, ZeRO, quantization, compression, offload) to reduce GPU and communication costs. Experiments (LLaMA-7B, OPT-2.7B) show LoRA is the strongest PEFT in FL, federated training beats local training, PEFT cuts communication from ~28GB to MBs, and FedOT works at light compression (
Problem Statement
Fine-tuning LLMs helps domain tasks but clashes with privacy and cost: clients cannot share data, full-parameter tuning is huge in bandwidth and memory, and some LLMs are closed-source so clients cannot access full models. Existing FL frameworks lack benchmarks, PEFT support, and resource-efficient operators tailored for federated LLM fine-tuning.
Main Contribution
FS-LLM: an open-source package that bundles benchmarking data, PEFT algorithms, and training operators for federated LLM fine-tuning.
LLM-BENCHMARKS: curated federated versions of common corpora (Fed-CodeAlpaca, Fed-Dolly, Fed-GSM8K-3) plus evaluation pipelines (HumanEval, HELM, GSM8K-test).
LLM-ALGZOO: federated implementations of PEFT methods (LoRA, P-tuning, prompt tuning) and a FedOT workflow for closed-source models.
LLM-TRAINER: hookable acceleration and resource operators (mixed precision, DeepSpeed/ZeRO, quantization, streaming, compression, CPU offload) and multi-mode runs (simulated, distributed, clustered).
Extensive reproducible experiments and public code: https://github.com/alibaba/FederatedScope/tree/llm.
Key Findings
LoRA gives the strongest PEFT results across domains in FL.
Federated (collaborative) fine-tuning outperforms local-only fine-tuning.
PEFT cuts communication from tens of GB to MB per round.
Closed-model fine-tuning (FedOT) can work but is sensitive to compression.
Compute heterogeneity affects wall-clock progress; A100 ~2x faster than V100 for same PEFT step time.
Results
Pass@1 (HumanEval)
HELM average score
Adapter message size
FedOT effect of compression
Computation time per step
Who Should Care
What To Try In 7 Days
Clone FS-LLM repo and run the Fed-CodeAlpaca benchmark on a single A100 to reproduce results.
Switch to LoRA adapters for your federated workflows and measure message size and wall-clock time.
Test FedOT with a light emulator (≈20% layer drop) if you must protect a closed LLM provider model.
Optimization Features
Infra Optimization
- CPU offloading
- multi-GPU parallelism
Model Optimization
- LoRA
- offsite-tuning (FedOT) emulator
System Optimization
- message streaming
- DEFLATE/Gzip compression
- quantization to 16/8-bit
Training Optimization
- mixed-precision
- gradient accumulation
- DeepSpeed ZeRO offload
- data parallelism
Reproducibility
Data Urls
- Fed-CodeAlpaca (derived from CodeAlpaca)
- Fed-Dolly (derived from Databricks-dolly-15k)
- Fed-GSM8K-3 (derived from GSM8K)
- Alpaca, CleanedAlpaca
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use batch size 1 due to resource limits; results may change with larger batches.
- Prompt design and initialization affect results; the paper fixes prompts for fairness.
- FedOT trades compression for performance; high compression (50%) degrades capabilities.
When Not To Use
- When clients are extremely resource-constrained (no GPU) — PEFT still needs nontrivial compute.
- When you require full-model updates or bespoke architectures not supported by adapter hooks.
Failure Modes
- Over-compressing closed-model emulators causes large accuracy drops (FedOT at 50%).
- Validation loss may not predict final generalization — HPO driven by validation loss can fail.
- Low-precision accelerations can break some personalized FL algorithms (pFedMe) via precision loss.
Core Entities
Models
- LLaMA-7B
- OPT-2.7B
Metrics
- Pass@1
- HELM average (mixture of subtasks)
- Accuracy
- Message size (bytes)
- GPU memory (MB)
- Computation time per step (sec)
Datasets
- Fed-CodeAlpaca
- Fed-Dolly
- Fed-GSM8K-3
- Fed-CodeSearchNet
- Alpaca
- CleanedAlpaca
Benchmarks
- HumanEval
- HELM
- GSM8K-test

