Split FFNs into sparse experts + a teacher-guided router to cut FLOPs and adapt LLMs with tiny data

Overview

Decision SnapshotNeeds Validation

Method is simple and practical for small LLMs: clear FLOPs gains shown, but experiments are on 1–1.4B models with limited hardware, so expect engineering effort to scale and validate on larger production models.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Zhongyu Zhao, Menghang Dong, Rongyu Zhang, Wenzhao Zheng, Yunpeng Zhang, Huanrui Yang, Dalong Du, Kurt Keutzer, Shanghang Zhang

Links

Abstract / PDF / Code

Why It Matters For Business

FactorLLM can cut FFN compute and lower inference costs significantly while enabling fast domain adaptation with tiny datasets, enabling cheaper, faster deployment for task-specific LLMs.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

FactorLLM splits a pretrained transformer feed-forward layer (FFN) into equal-size sparse subnetworks treated as Mixture-of-Experts (MoE). A small injected router is trained by a teacher-student Prior-Approximate Router (PAR) loss so only a few experts activate per token. On TinyLlama/MobileLlama, FactorLLM reduces FFN FLOPs dramatically (up to ~75% for 1R4E1K), lowers total compute ~30–50% in some settings, and retains around 85% of original accuracy after fine-tuning on very small amounts of data (0.03–0.04%). Code is available.

Problem Statement

Monolithic FFNs in transformers hold redundant, mixed knowledge and waste compute. We need a low-overhead way to split that knowledge so only task-relevant parts run at inference and the model can adapt with very little data.

Main Contribution

A simple factorization that permutes and partitions pretrained FFN weights into N equal subnetworks (experts) without changing weight values.

Prior-Approximate Router (PAR): a teacher-student routing loss that creates pseudo-labels from the original FFN to train a small injected router quickly.

Key Findings

Large FFN FLOPs can be cut heavily by activating fewer experts.

NumbersFFN GFLOPs reduced ~75% for 1R4E1K

Practical UseIf you run only 1 of 4 experts per token you can slash FFN compute and lower inference costs; expect attention to become the new compute bottleneck.

Evidence RefSection 4.3; Figure 3

Tradeoff: big FLOPs savings with modest accuracy loss.

NumbersTotal compute reduced ~50% and accuracy retained >85% (1R4E2K)

Practical UseUse a 1R4E2K setup when you want ~half the compute with ~85% of original accuracy on standard NLU tasks.

Evidence RefAbstract; Section 4.3; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
FFN GFLOPs reduction	~75% (FFN) for 1R4E1K	dense FFN (K=N)	-75% GFLOPs	compute profile (Figure 3)	Section 4.3 reports ~75% FFN GFLOPs reduction for 1R4E1K.	Section 4.3; Figure 3
Total compute reduction	~50% (1R4E2K)	original model compute	-~50% total FLOPs	compute profile (Figure 3)	Section 4.3 states nearly 50% reduction under 1R4E2K while retaining >85% accuracy.	Section 4.3

What To Try In 7 Days

Take a small LLM (TinyLlama/MobileLlama), permute and split FFN into 4 experts and implement a TopK router.

Train the injected router via PAR using a frozen teacher FFN and a small in-domain dataset (tens of millions tokens).

Benchmark 1R4E2K and 1R4E1K: measure GFLOPs, latency, and accuracy to pick tradeoff for production.

Optimization Features

Token Efficiency

converges with ~30M–50M tokens vs pretraining scale

Model Optimization

FFN factorization into equal-size expertssparse expert activation (MoE-style)

Training Optimization

Prior-Approximate Router (PAR) teacher-student lossfreeze teacher FFNs and fine-tune only routers+expertsfew-step fine-tuning on small datasets (0.03–0.04% data)

Inference Optimization

TopK router to activate K experts per tokenreduced FFN compute via sparse activation

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/zhenwuweihe/FactorLLM

Risks & Boundaries

Limitations

Experiments limited to small LLMs (TinyLlama, MobileLlama); results may differ on larger models.

When FFN compute is reduced, attention becomes the new bottleneck and accuracy can drop (noted in Section 4.3).

When Not To Use

When you require full original accuracy for every task (FactorLLM incurs up to ~15% relative drop in some configs).

If attention layers dominate your compute and you cannot change them, FFN factorization yields limited overall speedup.

Failure Modes

Experts collapse into similar modules if no router training is used, reducing diversity and benefit (Ex0 vs Ex3).

Router allocation instability early in training; needs PAR pseudo-labels to stabilize (Section 4.4).

Core Entities

Models

TinyLlamaMobileLlamaFactorLLM (1R4E2K, 1R4E1K, 1R4E3K variants)

Metrics

AccuracyGFLOPs (attention, FFN)Relative maintenance (%)

Datasets

Pajama (subset used for training)

Benchmarks

HellaSwagOpenBookQAWinograndeARC-EasyARC-ChallengeBoolQPIQAMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large FFN FLOPs can be cut heavily by activating fewer experts.

Tradeoff: big FLOPs savings with modest accuracy loss.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

Key finding

Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

Key finding

Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

Key finding