FS-LLM: an open toolbox to run, benchmark and speed up federated fine‑tuning of LLMs

September 1, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

9

Authors

Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou

Links

Abstract / PDF

Why It Matters For Business

FS-LLM lets organizations co‑train LLMs across private data while cutting bandwidth and memory needs using PEFT and resource operators; this reduces cost and preserves IP when the full model must stay closed.

Summary TLDR

This paper releases FS-LLM, an open-source system for fine-tuning large language models (LLMs) in federated learning (FL). FS-LLM bundles (1) LLM-BENCHMARKS: curated federated datasets and evaluation tasks, (2) LLM-ALGZOO: implementations of parameter-efficient fine-tuning (PEFT) methods (LoRA, P-tuning, prompt tuning) and a closed-model workflow (FedOT), and (3) LLM-TRAINER: hookable training operators (mixed precision, ZeRO, quantization, compression, offload) to reduce GPU and communication costs. Experiments (LLaMA-7B, OPT-2.7B) show LoRA is the strongest PEFT in FL, federated training beats local training, PEFT cuts communication from ~28GB to MBs, and FedOT works at light compression (

Problem Statement

Fine-tuning LLMs helps domain tasks but clashes with privacy and cost: clients cannot share data, full-parameter tuning is huge in bandwidth and memory, and some LLMs are closed-source so clients cannot access full models. Existing FL frameworks lack benchmarks, PEFT support, and resource-efficient operators tailored for federated LLM fine-tuning.

Main Contribution

FS-LLM: an open-source package that bundles benchmarking data, PEFT algorithms, and training operators for federated LLM fine-tuning.

LLM-BENCHMARKS: curated federated versions of common corpora (Fed-CodeAlpaca, Fed-Dolly, Fed-GSM8K-3) plus evaluation pipelines (HumanEval, HELM, GSM8K-test).

LLM-ALGZOO: federated implementations of PEFT methods (LoRA, P-tuning, prompt tuning) and a FedOT workflow for closed-source models.

LLM-TRAINER: hookable acceleration and resource operators (mixed precision, DeepSpeed/ZeRO, quantization, streaming, compression, CPU offload) and multi-mode runs (simulated, distributed, clustered).

Extensive reproducible experiments and public code: https://github.com/alibaba/FederatedScope/tree/llm.

Key Findings

LoRA gives the strongest PEFT results across domains in FL.

NumbersFed LLaMA-7B: LoRA 13.29% vs P-tuning 9.71% and prompt 9.63% on Fed-CodeAlpaca (Pass@1)

Federated (collaborative) fine-tuning outperforms local-only fine-tuning.

NumbersLLaMA-7B LoRA: Fed 13.29% vs Local 10.99% on Fed-CodeAlpaca (Pass@1)

PEFT cuts communication from tens of GB to MB per round.

NumbersFull-parameter LLaMA-7B ~28 GB/round vs LoRA adapter 21.4 MB (message size)

Closed-model fine-tuning (FedOT) can work but is sensitive to compression.

NumbersFedOT (20% drop): Fed-Dolly 44.88% vs LocalOT 38.45%; FedOT (50% drop) Fed-Dolly 37.01%

Compute heterogeneity affects wall-clock progress; A100 ~2x faster than V100 for same PEFT step time.

NumbersLLaMA-7B LoRA compute time: A100 0.16s vs V100 0.33s per step

Results

Pass@1 (HumanEval)

ValueLoRA Fed: 13.29% ±0.10

BaselineLocal LoRA: 10.99% ±0.77

HELM average score

ValueLoRA Fed: 46.57% ±0.24

BaselinePrompt tuning Fed: 40.72% ±0.64

Adapter message size

ValueLoRA: 21.40 MB

BaselineFull-parameter LLaMA upload: ~28 GB

FedOT effect of compression

ValueFedOT (20% drop) Fed-Dolly: 44.88% ±0.75; (50% drop) Fed-Dolly: 37.01% ±2.34

BaselineLocalOT 20%: 38.45% ±9.57

Computation time per step

ValueA100: 0.16s ±0.02; V100: 0.33s ±0.07

Baselinesame PEFT algorithm

Who Should Care

What To Try In 7 Days

Clone FS-LLM repo and run the Fed-CodeAlpaca benchmark on a single A100 to reproduce results.

Switch to LoRA adapters for your federated workflows and measure message size and wall-clock time.

Test FedOT with a light emulator (≈20% layer drop) if you must protect a closed LLM provider model.

Optimization Features

Infra Optimization

  • CPU offloading
  • multi-GPU parallelism

Model Optimization

  • LoRA
  • offsite-tuning (FedOT) emulator

System Optimization

  • message streaming
  • DEFLATE/Gzip compression
  • quantization to 16/8-bit

Training Optimization

  • mixed-precision
  • gradient accumulation
  • DeepSpeed ZeRO offload
  • data parallelism

Reproducibility

Data Urls

  • Fed-CodeAlpaca (derived from CodeAlpaca)
  • Fed-Dolly (derived from Databricks-dolly-15k)
  • Fed-GSM8K-3 (derived from GSM8K)
  • Alpaca, CleanedAlpaca

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use batch size 1 due to resource limits; results may change with larger batches.
  • Prompt design and initialization affect results; the paper fixes prompts for fairness.
  • FedOT trades compression for performance; high compression (50%) degrades capabilities.

When Not To Use

  • When clients are extremely resource-constrained (no GPU) — PEFT still needs nontrivial compute.
  • When you require full-model updates or bespoke architectures not supported by adapter hooks.

Failure Modes

  • Over-compressing closed-model emulators causes large accuracy drops (FedOT at 50%).
  • Validation loss may not predict final generalization — HPO driven by validation loss can fail.
  • Low-precision accelerations can break some personalized FL algorithms (pFedMe) via precision loss.

Core Entities

Models

  • LLaMA-7B
  • OPT-2.7B

Metrics

  • Pass@1
  • HELM average (mixture of subtasks)
  • Accuracy
  • Message size (bytes)
  • GPU memory (MB)
  • Computation time per step (sec)

Datasets

  • Fed-CodeAlpaca
  • Fed-Dolly
  • Fed-GSM8K-3
  • Fed-CodeSearchNet
  • Alpaca
  • CleanedAlpaca

Benchmarks

  • HumanEval
  • HELM
  • GSM8K-test