FS-LLM: an open toolbox to run, benchmark and speed up federated fine‑tuning of LLMs

September 1, 20238 min

Overview

Decision SnapshotNeeds Validation

FS-LLM is a usable system with reproducible experiments and clear engineering choices; evidence is solid for the studied models and settings but limited by batch size, hardware homogeneity, and chosen datasets.

Citations9

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FS-LLM lets organizations co‑train LLMs across private data while cutting bandwidth and memory needs using PEFT and resource operators; this reduces cost and preserves IP when the full model must stay closed.

Who Should Care

Summary TLDR

This paper releases FS-LLM, an open-source system for fine-tuning large language models (LLMs) in federated learning (FL). FS-LLM bundles (1) LLM-BENCHMARKS: curated federated datasets and evaluation tasks, (2) LLM-ALGZOO: implementations of parameter-efficient fine-tuning (PEFT) methods (LoRA, P-tuning, prompt tuning) and a closed-model workflow (FedOT), and (3) LLM-TRAINER: hookable training operators (mixed precision, ZeRO, quantization, compression, offload) to reduce GPU and communication costs. Experiments (LLaMA-7B, OPT-2.7B) show LoRA is the strongest PEFT in FL, federated training beats local training, PEFT cuts communication from ~28GB to MBs, and FedOT works at light compression (

Problem Statement

Fine-tuning LLMs helps domain tasks but clashes with privacy and cost: clients cannot share data, full-parameter tuning is huge in bandwidth and memory, and some LLMs are closed-source so clients cannot access full models. Existing FL frameworks lack benchmarks, PEFT support, and resource-efficient operators tailored for federated LLM fine-tuning.

Main Contribution

FS-LLM: an open-source package that bundles benchmarking data, PEFT algorithms, and training operators for federated LLM fine-tuning.

LLM-BENCHMARKS: curated federated versions of common corpora (Fed-CodeAlpaca, Fed-Dolly, Fed-GSM8K-3) plus evaluation pipelines (HumanEval, HELM, GSM8K-test).

Key Findings

LoRA gives the strongest PEFT results across domains in FL.

NumbersFed LLaMA-7B: LoRA 13.29% vs P-tuning 9.71% and prompt 9.63% on Fed-CodeAlpaca (Pass@1)

Practical UseIf you federated fine-tune LLMs, prioritize LoRA as the default PEFT method for better accuracy on code, language and CoT tasks.

Evidence RefTable 2

Federated (collaborative) fine-tuning outperforms local-only fine-tuning.

NumbersLLaMA-7B LoRA: Fed 13.29% vs Local 10.99% on Fed-CodeAlpaca (Pass@1)

Practical UseEncourage clients to join FL — collaborative PEFT offers measurable gains over isolated local fine-tuning.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (HumanEval)LoRA Fed: 13.29% ±0.10Local LoRA: 10.99% ±0.77+2.30 ppFed-CodeAlpacaFederated LoRA outperforms local; Table 2Table 2
HELM average scoreLoRA Fed: 46.57% ±0.24Prompt tuning Fed: 40.72% ±0.64+5.85 ppFed-DollyLoRA yields higher generic language scores in FL; Table 2Table 2

What To Try In 7 Days

Clone FS-LLM repo and run the Fed-CodeAlpaca benchmark on a single A100 to reproduce results.

Switch to LoRA adapters for your federated workflows and measure message size and wall-clock time.

Test FedOT with a light emulator (≈20% layer drop) if you must protect a closed LLM provider model.

Optimization Features

Infra Optimization
CPU offloadingmulti-GPU parallelism
Model Optimization
LoRAoffsite-tuning (FedOT) emulator
System Optimization
message streamingDEFLATE/Gzip compressionquantization to 16/8-bit
Training Optimization
mixed-precisiongradient accumulationDeepSpeed ZeRO offloaddata parallelism

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Fed-CodeAlpaca (derived from CodeAlpaca)Fed-Dolly (derived from Databricks-dolly-15k)Fed-GSM8K-3 (derived from GSM8K)Alpaca, CleanedAlpaca

Risks & Boundaries

Limitations

Experiments use batch size 1 due to resource limits; results may change with larger batches.

Prompt design and initialization affect results; the paper fixes prompts for fairness.

When Not To Use

When clients are extremely resource-constrained (no GPU) — PEFT still needs nontrivial compute.

When you require full-model updates or bespoke architectures not supported by adapter hooks.

Failure Modes

Over-compressing closed-model emulators causes large accuracy drops (FedOT at 50%).

Validation loss may not predict final generalization — HPO driven by validation loss can fail.

Core Entities

Models

LLaMA-7BOPT-2.7B

Metrics

Pass@1HELM average (mixture of subtasks)AccuracyMessage size (bytes)GPU memory (MB)Computation time per step (sec)

Datasets

Fed-CodeAlpacaFed-DollyFed-GSM8K-3Fed-CodeSearchNetAlpacaCleanedAlpaca

Benchmarks

HumanEvalHELMGSM8K-test