Freeze pretrained MoE experts, aggregate only shared layers, and graft one personalized expert per client for efficient federated LLM tuning

June 1, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

0

Authors

Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi

Links

Abstract / PDF

Why It Matters For Business

FLEx lowers federated communication and avoids corrupting pretrained knowledge, enabling client-specific LLM behavior with smaller bandwidth and safer global models.

Summary TLDR

FLEx is a federated fine-tuning recipe for Mixture-of-Experts (MoE) LLMs that freezes pretrained experts, aggregates only the dense shared layers, and creates a single lightweight per-client expert by pruning ("grafting"). Clients train their grafted expert and a small sigmoid gate locally while the server averages only non-expert parameters. On instruction-following and knowledge benchmarks, FLEx improves average ROUGE-L (43.13 vs ~42.37 for best federated baseline) and yields the highest MMLU among compared methods (49.74). The method cuts communication costs and avoids corrupting pretrained expert knowledge, but it limits personalization to one grafted expert per layer and assumes cross-s

Problem Statement

Federated tuning of MoE LLMs is hard because (1) experts are sparsely activated and different clients use different experts, so naive aggregation forces sending all expert weights (huge communication cost) and (2) averaging expert weights across heterogeneous clients can corrupt pretrained world knowledge (catastrophic forgetting).

Main Contribution

Selective aggregation: only aggregate dense non-expert parameters and keep pretrained experts frozen to cut communication and protect world knowledge.

Expert grafting: build one lightweight personalized expert per client by pruning components from frozen pretrained experts using a reconstruction loss.

Adaptive integration: train a small sigmoid gate plus the grafted expert and shared non-expert layers locally so the model learns when to use shared vs personalized expertise.

Key Findings

FLEx improves average instruction-following quality over federated baselines.

NumbersROUGE-L avg 43.13 (FLEx) vs 42.37 (best federated baseline on Table 1)

FLEx preserves general knowledge while personalizing.

NumbersMMLU 49.74 (FLEx) — highest among compared methods (Table 1)

The adaptive gate is essential; inserting a grafted expert without it collapses performance.

NumbersROUGE-L drops to 5.16 without gate vs 43.14 with full FLEx (Table 3)

Personalizing more experts brings tiny gains but costs explode.

NumbersPruning time: 1 expert 58s, 2 experts 3,246s, 3 experts 87,091s (Table 7)

FLEx achieves better expert load balance.

NumbersExpert activation std 1259.88 (FLEx) vs 1859.80 (original) and 1906.85 (FedAvg) on C4 (Section C.6)

Results

ROUGE-L (avg, pathological non-IID)

Value43.13 (FLEx)

Baseline42.37 (MoE+FedAvgM)

MMLU (knowledge retention)

Value49.74 (FLEx)

Baseline47.06 (MoE+FedAdagrad)

Vicuna Helpfulness

Value6.360 (FLEx)

Baseline5.145 (Base model)

Vicuna Harmlessness

Value7.993 (FLEx)

Baseline6.319 (Base model)

Ablation: graft w/o gate

ValueROUGE-L 5.16

BaselineFedAvg 41.17

Who Should Care

What To Try In 7 Days

Pick an open-source MoE LLM (e.g., Qwen1.5-MoE) and freeze its pretrained experts.

Modify training loop to aggregate only non-expert layers and keep experts local.

Implement one-shot expert selection by minimizing reconstruction loss on local validation data and prune to form a grafted expert for each client layer.Listen with a sigmoid gate f

Optimization Features

Infra Optimization

  • lower bandwidth and GPU compute for federated rounds vs full-expert aggregation

Model Optimization

  • freeze pretrained experts to protect knowledge
  • prune/graft expert components based on local data

System Optimization

  • reduced communication by sending only non-expert parameters

Training Optimization

  • train only non-expert layers plus one grafted expert and gate locally
  • LoRA

Inference Optimization

  • leverage MoE sparsity; grafted expert runs in parallel with frozen experts

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Only one personalized expert per layer is grafted; authors note scaling to multiple experts is costly.
  • Experiments use cross-silo (server-like) clients and public datasets; cross-device heterogeneity and privacy trade-offs are not fully explored.
  • Grafting depends on good pruning/selection; poor choices can underperform.
  • Communication-cost numbers are relative in paper tables and depend on model architecture and activation sparsity.

When Not To Use

  • If clients can centrally fine-tune and share experts (no bandwidth constraint), simpler full-finetuning may be preferable.
  • If you need richer personalization that requires multiple personalized experts per layer and clients have abundant compute.
  • When strict real-time on-device latency forbids any additional gating or parallel expert execution.

Failure Modes

  • Inserting a grafted expert without the adaptive sigmoid gate can collapse model performance (see ablation).
  • Poor expert selection can waste local compute and offer no personalization gains.
  • If the pretrained experts are mismatched to client tasks, freezing them may limit achievable personalization.

Core Entities

Models

  • Qwen1.5-MoE-A2.7B
  • DeepSeek-MoE-16B-Base

Metrics

  • ROUGE-L
  • MMLU
  • Helpfulness
  • Harmlessness
  • Expert activation std

Datasets

  • Databricks-dolly-15k
  • Alpaca-gpt4
  • Finance-Alpaca
  • MedAlpaca
  • C4

Benchmarks

  • MMLU
  • Vicuna