Overview
FS-LLM is a usable system with reproducible experiments and clear engineering choices; evidence is solid for the studied models and settings but limited by batch size, hardware homogeneity, and chosen datasets.
Citations9
Evidence Strength0.70
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
FS-LLM lets organizations co‑train LLMs across private data while cutting bandwidth and memory needs using PEFT and resource operators; this reduces cost and preserves IP when the full model must stay closed.
Who Should Care
Summary TLDR
This paper releases FS-LLM, an open-source system for fine-tuning large language models (LLMs) in federated learning (FL). FS-LLM bundles (1) LLM-BENCHMARKS: curated federated datasets and evaluation tasks, (2) LLM-ALGZOO: implementations of parameter-efficient fine-tuning (PEFT) methods (LoRA, P-tuning, prompt tuning) and a closed-model workflow (FedOT), and (3) LLM-TRAINER: hookable training operators (mixed precision, ZeRO, quantization, compression, offload) to reduce GPU and communication costs. Experiments (LLaMA-7B, OPT-2.7B) show LoRA is the strongest PEFT in FL, federated training beats local training, PEFT cuts communication from ~28GB to MBs, and FedOT works at light compression (
Problem Statement
Fine-tuning LLMs helps domain tasks but clashes with privacy and cost: clients cannot share data, full-parameter tuning is huge in bandwidth and memory, and some LLMs are closed-source so clients cannot access full models. Existing FL frameworks lack benchmarks, PEFT support, and resource-efficient operators tailored for federated LLM fine-tuning.
Main Contribution
FS-LLM: an open-source package that bundles benchmarking data, PEFT algorithms, and training operators for federated LLM fine-tuning.
LLM-BENCHMARKS: curated federated versions of common corpora (Fed-CodeAlpaca, Fed-Dolly, Fed-GSM8K-3) plus evaluation pipelines (HumanEval, HELM, GSM8K-test).
Key Findings
LoRA gives the strongest PEFT results across domains in FL.
Federated (collaborative) fine-tuning outperforms local-only fine-tuning.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass@1 (HumanEval) | LoRA Fed: 13.29% ±0.10 | Local LoRA: 10.99% ±0.77 | +2.30 pp | Fed-CodeAlpaca | Federated LoRA outperforms local; Table 2 | Table 2 |
| HELM average score | LoRA Fed: 46.57% ±0.24 | Prompt tuning Fed: 40.72% ±0.64 | +5.85 pp | Fed-Dolly | LoRA yields higher generic language scores in FL; Table 2 | Table 2 |
What To Try In 7 Days
Clone FS-LLM repo and run the Fed-CodeAlpaca benchmark on a single A100 to reproduce results.
Switch to LoRA adapters for your federated workflows and measure message size and wall-clock time.
Test FedOT with a light emulator (≈20% layer drop) if you must protect a closed LLM provider model.
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Experiments use batch size 1 due to resource limits; results may change with larger batches.
Prompt design and initialization affect results; the paper fixes prompts for fairness.
When Not To Use
When clients are extremely resource-constrained (no GPU) — PEFT still needs nontrivial compute.
When you require full-model updates or bespoke architectures not supported by adapter hooks.
Failure Modes
Over-compressing closed-model emulators causes large accuracy drops (FedOT at 50%).
Validation loss may not predict final generalization — HPO driven by validation loss can fail.

