Train LLMs on private data with federated learning; OpenFedLLM shows FL beats local training and can beat GPT‑4 in finance

February 10, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.55

Cost Impact Score

0.65

Citation Count

6

Authors

Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, Siheng Chen

Links

Abstract / PDF

Why It Matters For Business

Companies with private domain data can jointly fine-tune LLMs privately and get measurable gains over solo training; finance firms, hospitals, and firms with sensitive data can gain domain-leading models without sharing raw data.

Summary TLDR

This paper builds OpenFedLLM, a research-friendly framework for fine-tuning large language models (LLMs) using federated learning (FL) on private, distributed data. It implements federated instruction tuning and federated value alignment (DPO), seven FL algorithms, LoRA PEFT, int8 quantization, eight training datasets and 30+ evaluations. Key empirical results: FL consistently beats single-client (local) training across domains; on a general setting they report ≥12% relative improvement on MT-Bench; on a finance task FL models trained from Llama2-7B beat GPT-4 on evaluated benchmarks. Training is practical: with LoRA + int8 they run FL on one RTX 3090 and report ~1–2 hours per client for 100

Problem Statement

Public high-quality data for LLMs is becoming scarce, while useful private datasets sit siloed across organizations. Small parties cannot fine-tune strong LLMs on their own. We need a privacy-preserving, practical way to pool private instruction and preference data to improve LLMs without sharing raw data.

Main Contribution

OpenFedLLM framework: integrates federated instruction tuning (FedIT), federated value alignment (FedVA via DPO), 7 FL algorithms, 8 datasets and 30+ metrics.

Comprehensive empirical study across domains (general, finance, medical, code, math) showing FL consistently improves over local training and can exceed GPT-4 in a finance benchmark.

Practical recipe: LoRA PEFT + int8 quantization + memory-saving tricks to run federated LLM fine-tuning on a single consumer GPU.

Key Findings

Federated learning consistently improves over single-client local fine-tuning across tasks.

Numbersmultiple tables: e.g., Table 4 MT-Avg FedAvg 3.346 vs Local 2.844 (open-ended)

On a general instruction-tuning evaluation (MT-Bench) they report at least a 12% relative improvement.

Numberspaper claim: “≥ 12% improvement on MT-Bench on general dataset”

On finance benchmarks, FL-trained Llama2-7B models outperform GPT-4 on evaluated metrics.

NumbersTable 5: FedAvg Avg:4 Acc 0.791 vs GPT-4 Avg:3 Acc ~0.731 (evaluated set)

Training is feasible on consumer hardware using PEFT and quantization.

Numbersint8 + LoRA; reported 1–2 hours per client for 100 rounds on an RTX 3090

No single FL algorithm dominates across all domains and metrics.

NumbersDifferent winners: FedYogi/SCAFFOLD in general, SCAFFOLD in finance, FedAdagrad in code, FedAdam sometimes best in med (

Results

MT-Avg (open-ended)

ValueFedAvg 3.346 vs Local 2.844

BaselineLocal training

Accuracy

ValueFedAvg Avg:4 Acc 0.791 vs Local 0.699

BaselineLocal training

Finance vs GPT-4

ValueFedAvg outperforms GPT-4 on evaluated finance benchmarks

BaselineGPT-4

Compute per client

Value1–2 hours per client for 100 rounds on RTX 3090

BaselineN/A

Who Should Care

What To Try In 7 Days

Run OpenFedLLM with FedAvg + LoRA on a small private dataset (1–5k samples) using int8 on a 3090 to validate gains vs local fine-tuning.

Compare 2–3 FL algorithms (e.g., FedAvg, SCAFFOLD, FedAdagrad) on your domain benchmark to pick the best aggregator.

If alignment matters, try FedDPO on a small preference set to improve harmlessness/helpfulness before deployment.

Optimization Features

Token Efficiency

  • max sequence length 512
  • LoRA

Infra Optimization

  • single-GPU training feasible for prototyping (3090), A100 used for larger runs

Model Optimization

  • int8 quantization

System Optimization

  • memory-saving techniques to fit 7B model on one 3090

Training Optimization

  • LoRA
  • cosine LR schedule
  • server-side adaptive optimizers (FedAdam/FedAdagrad/FedYogi)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments focus on LoRA-style PEFT and 7B models; conclusions may not hold for full-parameter training or much larger models.
  • Client sampling and IID splits appear in many experiments; performance under extreme non-IID, cross-device, or massive-client settings is less explored.
  • Security, privacy leakage, and stealthy malicious clients are discussed but not fully solved; defense effectiveness in FedLLM remains open.

When Not To Use

  • If you need full-model fine-tuning or plan to pretrain from scratch — this framework targets instruction/value fine-tuning.
  • When clients cannot run the required LoRA/quantized stacks or lack stable compute/communication for repeated rounds.

Failure Modes

  • Heterogeneous client preferences can degrade global model usefulness for individual clients (need personalization).
  • Malicious clients with logically correct but harmful examples can poison alignment unless robust defenses are applied.
  • Differential privacy or stricter privacy settings may reduce utility; unintended memorization remains a risk.

Core Entities

Models

  • Llama2-7B
  • GPT-4
  • GPT-3.5
  • Vicuna
  • Wizard-Vicuna

Metrics

  • Accuracy
  • F1
  • Pass@1
  • BLEU
  • MT-Avg
  • Vicuna
  • MMLU score

Datasets

  • Alpaca
  • Alpaca-GPT4
  • FinGPT (sentiment)
  • MedAlpaca
  • Code-Alpaca
  • MathInstruct
  • UltraFeedback
  • HH-RLHF

Benchmarks

  • MT-Bench
  • Vicuna-Bench
  • MMLU
  • BBH
  • HumanEval
  • MBPP
  • GSM8K
  • FPB
  • FiQA-SA
  • TFNS
  • NWGI