FS-LLM: an open toolbox to run, benchmark and speed up federated fine‑tuning of LLMs

Overview

Decision SnapshotNeeds Validation

FS-LLM is a usable system with reproducible experiments and clear engineering choices; evidence is solid for the studied models and settings but limited by batch size, hardware homogeneity, and chosen datasets.

Citations9

Evidence Strength0.70

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Weirui Kuang, Bingchen Qian, Zitao Li, Daoyuan Chen, Dawei Gao, Xuchen Pan, Yuexiang Xie, Yaliang Li, Bolin Ding, Jingren Zhou

Links

Abstract / PDF / Code / Data

Why It Matters For Business

FS-LLM lets organizations co‑train LLMs across private data while cutting bandwidth and memory needs using PEFT and resource operators; this reduces cost and preserves IP when the full model must stay closed.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

This paper releases FS-LLM, an open-source system for fine-tuning large language models (LLMs) in federated learning (FL). FS-LLM bundles (1) LLM-BENCHMARKS: curated federated datasets and evaluation tasks, (2) LLM-ALGZOO: implementations of parameter-efficient fine-tuning (PEFT) methods (LoRA, P-tuning, prompt tuning) and a closed-model workflow (FedOT), and (3) LLM-TRAINER: hookable training operators (mixed precision, ZeRO, quantization, compression, offload) to reduce GPU and communication costs. Experiments (LLaMA-7B, OPT-2.7B) show LoRA is the strongest PEFT in FL, federated training beats local training, PEFT cuts communication from ~28GB to MBs, and FedOT works at light compression (

Problem Statement

Fine-tuning LLMs helps domain tasks but clashes with privacy and cost: clients cannot share data, full-parameter tuning is huge in bandwidth and memory, and some LLMs are closed-source so clients cannot access full models. Existing FL frameworks lack benchmarks, PEFT support, and resource-efficient operators tailored for federated LLM fine-tuning.

Main Contribution

FS-LLM: an open-source package that bundles benchmarking data, PEFT algorithms, and training operators for federated LLM fine-tuning.

LLM-BENCHMARKS: curated federated versions of common corpora (Fed-CodeAlpaca, Fed-Dolly, Fed-GSM8K-3) plus evaluation pipelines (HumanEval, HELM, GSM8K-test).

Key Findings

LoRA gives the strongest PEFT results across domains in FL.

NumbersFed LLaMA-7B: LoRA 13.29% vs P-tuning 9.71% and prompt 9.63% on Fed-CodeAlpaca (Pass@1)

Practical UseIf you federated fine-tune LLMs, prioritize LoRA as the default PEFT method for better accuracy on code, language and CoT tasks.

Evidence RefTable 2

Federated (collaborative) fine-tuning outperforms local-only fine-tuning.

NumbersLLaMA-7B LoRA: Fed 13.29% vs Local 10.99% on Fed-CodeAlpaca (Pass@1)

Practical UseEncourage clients to join FL — collaborative PEFT offers measurable gains over isolated local fine-tuning.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (HumanEval)	LoRA Fed: 13.29% ±0.10	Local LoRA: 10.99% ±0.77	+2.30 pp	Fed-CodeAlpaca	Federated LoRA outperforms local; Table 2	Table 2
HELM average score	LoRA Fed: 46.57% ±0.24	Prompt tuning Fed: 40.72% ±0.64	+5.85 pp	Fed-Dolly	LoRA yields higher generic language scores in FL; Table 2	Table 2

What To Try In 7 Days

Clone FS-LLM repo and run the Fed-CodeAlpaca benchmark on a single A100 to reproduce results.

Switch to LoRA adapters for your federated workflows and measure message size and wall-clock time.

Test FedOT with a light emulator (≈20% layer drop) if you must protect a closed LLM provider model.

Optimization Features

Infra Optimization

CPU offloadingmulti-GPU parallelism

Model Optimization

LoRAoffsite-tuning (FedOT) emulator

System Optimization

message streamingDEFLATE/Gzip compressionquantization to 16/8-bit

Training Optimization

mixed-precisiongradient accumulationDeepSpeed ZeRO offloaddata parallelism

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/alibaba/FederatedScope/tree/llm

Data URLs

Fed-CodeAlpaca (derived from CodeAlpaca)Fed-Dolly (derived from Databricks-dolly-15k)Fed-GSM8K-3 (derived from GSM8K)Alpaca, CleanedAlpaca

Risks & Boundaries

Limitations

Experiments use batch size 1 due to resource limits; results may change with larger batches.

Prompt design and initialization affect results; the paper fixes prompts for fairness.

When Not To Use

When clients are extremely resource-constrained (no GPU) — PEFT still needs nontrivial compute.

When you require full-model updates or bespoke architectures not supported by adapter hooks.

Failure Modes

Over-compressing closed-model emulators causes large accuracy drops (FedOT at 50%).

Validation loss may not predict final generalization — HPO driven by validation loss can fail.

Core Entities

Models

LLaMA-7BOPT-2.7B

Metrics

Pass@1HELM average (mixture of subtasks)AccuracyMessage size (bytes)GPU memory (MB)Computation time per step (sec)

Datasets

Fed-CodeAlpacaFed-DollyFed-GSM8K-3Fed-CodeSearchNetAlpacaCleanedAlpaca

Benchmarks

HumanEvalHELMGSM8K-test

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LoRA gives the strongest PEFT results across domains in FL.

Federated (collaborative) fine-tuning outperforms local-only fine-tuning.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding