Add frequency-aware experts to a Mixture-of-Experts Transformer and pretrain to cut forecasting error on public and commercial series

July 9, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.6

Citation Count

0

Authors

Yiwen Liu, Chenyu Zhang, Junjie Song, Siqi Chen, Sun Yin, Zihan Wang, Lingming Zeng, Yuji Cao, Junming Jiao

Links

Abstract / PDF

Why It Matters For Business

Better short-to-long forecasting where periodic patterns exist. The model lowers error vs a leading MoE baseline and keeps inference costs similar, so operational forecasting (store traffic, sales, energy) can be more accurate without extra latency.

Summary TLDR

MoFE-Time adds a Frequency-Time Cell (FTC) inside Mixture-of-Experts (MoE) blocks and uses pretraining → fine-tuning to learn both periodic (frequency) and temporal features. On six public benchmarks and a proprietary NEV-sales dataset it reduces MSE/MAE vs. Time‑MoE (the main baseline) and keeps inference speed comparable.

Problem Statement

Current large time-series models either ignore intrinsic frequency (periodic) structure or convert signals to frequency space outside the model. That leads to suboptimal forecasts and poor cross-dataset transfer when series have varying periodicity and non-stationarity.

Main Contribution

MoFE-Time: integrate a Frequency-Time Cell (FTC) inside each MoE expert to learn frequency and time features jointly.

Adopt a pretraining → fine-tuning workflow (use Time-300B for pretraining) to transfer prior pattern knowledge across datasets.

Introduce RevIN (reversible instance normalization) and temporal aggregation to handle non-stationarity and variable-length series.

Collected NEV-sales, a proprietary daily store-traffic dataset (~330k points across 498 series) to test commercial performance.

Key Findings

MoFE-Time improves average forecast error on six public benchmarks compared to Time‑MoE.

NumbersAverage MSE 0.2755, MAE 0.3226; MSE↓ 6.95%, MAE↓ 6.02% vs Time‑MoE

On the proprietary NEV-sales commercial dataset, MoFE-Time outperforms Time‑MoE.

NumbersMoFE MSE 0.1956 / MAE 0.3284 vs Time‑MoE MSE 0.2405 / MAE 0.3628

Pretraining provides the largest single ablation benefit; FTC and RevIN also help.

FTC experts learn concentrated energy at true harmonics; replacing FTC with feedforward weakens spectral focus and prediction.

MoFE-Time has similar or faster inference time than Time‑MoE with comparable parameter count.

Results

Average MSE (public benchmarks)

Value0.2755 (MoFE-Time)

Baseline0.2961 (Time‑MoE reported as next best)

Average MAE (public benchmarks)

Value0.3226 (MoFE-Time)

Baseline0.3433 (Time‑MoE)

NEV-sales MSE (commercial dataset)

Value0.1956 (MoFE-Time)

Baseline0.2405 (Time‑MoE)

NEV-sales MAE (commercial dataset)

Value0.3284 (MoFE-Time)

Baseline0.3628 (Time‑MoE)

Inference speed

ValueComparable or faster

BaselineTime‑MoE (same parameter scale)

Who Should Care

What To Try In 7 Days

If you use a transformer MoE baseline: add a frequency-aware expert block (FTC) and test on a holdout set.

Apply RevIN at input to reduce non-stationarity before training or fine-tuning.

If you have cross-domain series, pretrain on diverse time-series or fine-tune a large pre-trained checkpoint to capture prior patterns.

Agent Features

Memory

  • pretrained weights (transfer of prior pattern knowledge)

Architectures

  • MoE
  • Transformer-style attention
  • Frequency-Time Cell (FTC)

Optimization Features

Model Optimization

  • sparse MoE experts to focus compute
  • frequency-aware expert structure (FTC) to concentrate spectral energy

Training Optimization

  • pretraining on Time-300B then one-epoch fine-tuning on target sets
  • use of RevIN to stabilize training across non-stationary windows

Inference Optimization

  • sparse routing keeps inference cost comparable to Time‑MoE
  • model sized ~118M params evaluated on A100

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • NEV-sales is proprietary; gains on that dataset may not generalize to other commercial series.
  • Paper reports no public code or hyperparameters for full replication beyond optimizer basics.
  • Pretraining is a major driver of gains—if you cannot pretrain on large, diverse corpora, improvements may be smaller.

When Not To Use

  • When series have no clear periodic components or when FFT-style spectral features are irrelevant.
  • When you cannot pretrain or lack diverse time-series data for transfer learning.

Failure Modes

  • Spectral leakage or poor harmonic separation on short or highly irregular sequences.
  • Reduced benefit on datasets with unstable, non-periodic signals (authors note Exchange-rate is unstable).
  • Proprietary-data overfitting: FTC may latch onto dataset-specific harmonics that don’t generalize.

Core Entities

Models

  • MoFE-Time
  • Time-MoE
  • TimeMixer
  • TimeXer
  • TimesNet
  • AutoFormer
  • PatchTST

Metrics

  • MSE
  • MAE
  • inference time

Datasets

  • Time-300B
  • ETTh1
  • ETTh2
  • ETTm1
  • ETTm2
  • Weather
  • Exchange
  • NEV-sales

Benchmarks

  • ETTh1
  • ETTh2
  • ETTm1
  • ETTm2
  • Weather
  • Exchange

Context Entities

Models

  • Moment
  • Chronos
  • TimesFM
  • Lag-Llama

Datasets

  • public time series corpora (as used in Time-300B)