Split FFNs into sparse experts + a teacher-guided router to cut FLOPs and adapt LLMs with tiny data

August 15, 20247 min

Overview

Decision SnapshotNeeds Validation

Method is simple and practical for small LLMs: clear FLOPs gains shown, but experiments are on 1–1.4B models with limited hardware, so expect engineering effort to scale and validate on larger production models.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Zhongyu Zhao, Menghang Dong, Rongyu Zhang, Wenzhao Zheng, Yunpeng Zhang, Huanrui Yang, Dalong Du, Kurt Keutzer, Shanghang Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

FactorLLM can cut FFN compute and lower inference costs significantly while enabling fast domain adaptation with tiny datasets, enabling cheaper, faster deployment for task-specific LLMs.

Who Should Care

Summary TLDR

FactorLLM splits a pretrained transformer feed-forward layer (FFN) into equal-size sparse subnetworks treated as Mixture-of-Experts (MoE). A small injected router is trained by a teacher-student Prior-Approximate Router (PAR) loss so only a few experts activate per token. On TinyLlama/MobileLlama, FactorLLM reduces FFN FLOPs dramatically (up to ~75% for 1R4E1K), lowers total compute ~30–50% in some settings, and retains around 85% of original accuracy after fine-tuning on very small amounts of data (0.03–0.04%). Code is available.

Problem Statement

Monolithic FFNs in transformers hold redundant, mixed knowledge and waste compute. We need a low-overhead way to split that knowledge so only task-relevant parts run at inference and the model can adapt with very little data.

Main Contribution

A simple factorization that permutes and partitions pretrained FFN weights into N equal subnetworks (experts) without changing weight values.

Prior-Approximate Router (PAR): a teacher-student routing loss that creates pseudo-labels from the original FFN to train a small injected router quickly.

Key Findings

Large FFN FLOPs can be cut heavily by activating fewer experts.

NumbersFFN GFLOPs reduced ~75% for 1R4E1K

Practical UseIf you run only 1 of 4 experts per token you can slash FFN compute and lower inference costs; expect attention to become the new compute bottleneck.

Evidence RefSection 4.3; Figure 3

Tradeoff: big FLOPs savings with modest accuracy loss.

NumbersTotal compute reduced ~50% and accuracy retained >85% (1R4E2K)

Practical UseUse a 1R4E2K setup when you want ~half the compute with ~85% of original accuracy on standard NLU tasks.

Evidence RefAbstract; Section 4.3; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
FFN GFLOPs reduction~75% (FFN) for 1R4E1Kdense FFN (K=N)-75% GFLOPscompute profile (Figure 3)Section 4.3 reports ~75% FFN GFLOPs reduction for 1R4E1K.Section 4.3; Figure 3
Total compute reduction~50% (1R4E2K)original model compute-~50% total FLOPscompute profile (Figure 3)Section 4.3 states nearly 50% reduction under 1R4E2K while retaining >85% accuracy.Section 4.3

What To Try In 7 Days

Take a small LLM (TinyLlama/MobileLlama), permute and split FFN into 4 experts and implement a TopK router.

Train the injected router via PAR using a frozen teacher FFN and a small in-domain dataset (tens of millions tokens).

Benchmark 1R4E2K and 1R4E1K: measure GFLOPs, latency, and accuracy to pick tradeoff for production.

Optimization Features

Token Efficiency
converges with ~30M–50M tokens vs pretraining scale
Model Optimization
FFN factorization into equal-size expertssparse expert activation (MoE-style)
Training Optimization
Prior-Approximate Router (PAR) teacher-student lossfreeze teacher FFNs and fine-tune only routers+expertsfew-step fine-tuning on small datasets (0.03–0.04% data)
Inference Optimization
TopK router to activate K experts per tokenreduced FFN compute via sparse activation

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments limited to small LLMs (TinyLlama, MobileLlama); results may differ on larger models.

When FFN compute is reduced, attention becomes the new bottleneck and accuracy can drop (noted in Section 4.3).

When Not To Use

When you require full original accuracy for every task (FactorLLM incurs up to ~15% relative drop in some configs).

If attention layers dominate your compute and you cannot change them, FFN factorization yields limited overall speedup.

Failure Modes

Experts collapse into similar modules if no router training is used, reducing diversity and benefit (Ex0 vs Ex3).

Router allocation instability early in training; needs PAR pseudo-labels to stabilize (Section 4.4).

Core Entities

Models

TinyLlamaMobileLlamaFactorLLM (1R4E2K, 1R4E1K, 1R4E3K variants)

Metrics

AccuracyGFLOPs (attention, FFN)Relative maintenance (%)

Datasets

Pajama (subset used for training)

Benchmarks

HellaSwagOpenBookQAWinograndeARC-EasyARC-ChallengeBoolQPIQAMMLU