Make large Mixture-of-Experts models run faster on edge GPUs by prefetching experts using adjacent-layer gate inputs

February 17, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

Links

Abstract / PDF

Why It Matters For Business

Fate cuts MoE inference latency on edge-class GPUs by combining cross-layer prefetch, targeted caching, and hybrid INT2/INT4 transfers—enabling richer, privacy-friendly on-device LLM features with small accuracy trade-offs.

Summary TLDR

Fate is an offloading system for Mixture-of-Experts (MoE) language models that predicts which experts will be needed next by using the gate input from the previous layer (cross-layer prefetch). Combined with a shallow-favoring cache and a popularity-aware INT2/INT4 hybrid quantization, Fate raises prefetch accuracy to ~97%, increases expert hit rates to ~99%, and speeds up end-to-end inference on edge-class GPUs (e.g., up to 4.5× prefill and 4.1× decoding vs naive offload) while keeping accuracy loss small.

Problem Statement

MoE models activate only few experts per token but have massive total parameters, so edge GPUs cannot hold all experts. Offloading experts to CPU reduces memory needs but causes I/O stalls when experts are loaded on demand. Existing prefetch methods are either low-accuracy (activation-path) or require extra training (small predictor). We need a low-cost, high-accuracy prefetch + caching + quantization strategy that overlaps I/O with computation and respects prefill vs decoding differences.

Main Contribution

Cross-layer expert prefetch: use gate input from previous layer to predict next-layer experts with high accuracy, without extra training.

Shallow-favoring expert cache: allocate cache preferentially to early layers (where predictions are weaker) and use ARC eviction to reach very high hit rates.

Popularity-aware hybrid quantization: store INT4/INT2 variants and transfer INT2 for low-popularity experts during prefill to reduce I/O while bounding accuracy loss.

Key Findings

Cross-layer prefetch achieves very high prediction accuracy without retraining.

Numbers97.15% prefetch accuracy (by transferring experts above 75th-confidence percentile)

Caching shallow layers yields almost complete cache hits.

Numbers99.08% expert hit rate when caching layers 0–3

End-to-end speedups are large versus naive offloading and prior prefetching.

Numbersup to 4.5× prefill and 4.1× decoding vs Load on Demand; up to 1.9× prefill and 2.2× decoding vs EAP

Accuracy impact is small on well-provisioned hardware.

Numbersaverage accuracy loss <1% on high-end PC; up to ~3% average loss on low-end PC

Prefill and decoding require different optimizations.

NumbersPrefill: ~55.9 experts active per layer (out of 60) for Qwen1.5-MoE; decoding: per-token top-k activation

Results

prefill speed (tokens/s)

Valueup to 804 tokens/s (DeepseekMoE, prompt 512, high-end PC)

BaselineLoad on Demand

decoding speed (tokens/s)

Valueup to 14 tokens/s (Qwen1.5-MoE, high-end PC)

BaselineLoad on Demand

Accuracy

Value78.79% using previous-layer gate for decoding; 97.15% after transferring experts above 75th-confidence percentile

Baselineactivation path-based methods (lower)

expert cache hit rate

Value99.08% when caching shallow layers (layers 0–3)

BaselineLRU or naive caching

Accuracy

Valueavg <1% loss on high-end PC; avg ~3% loss on low-end PC (scores drop <2 points)

Baselineoriginal BF16 model

Who Should Care

What To Try In 7 Days

Measure cosine similarity of adjacent gate inputs in your MoE to validate cross-layer prefetch potential.

Implement a CPU-side prefetch that clones gate input and predicts next-layer experts to overlap I/O with GPU compute.

Prototype popularity-based prefill ordering and store INT2/INT4 variants in CPU to test hybrid quantized transfers.

Optimization Features

Token Efficiency

  • increases tokens/sec (prefill/decoding) via caching and overlap
  • reorders computation to maximize I/O-compute overlap

Infra Optimization

  • offload experts to CPU and transfer to GPU on demand
  • design works across PCIe 3.0 and older PCIe setups

Model Optimization

  • store INT4 and INT2 variants of experts for flexible transfers
  • use HQQ quantization for fast quant/dequant

System Optimization

  • clone intermediate state to CPU for parallel prediction
  • offline compute of per-layer timing constraints to bound prefetch
  • memory-budget aware cache sizing

Inference Optimization

  • cross-layer expert prefetch using previous-layer gate inputs
  • popularity-aware ordering of expert transfers in prefill
  • per-phase strategy: INT4 transfers in decoding, mixed INT2/INT4 in prefill
  • shallow-favoring cache allocation (prioritize early layers)
  • ARC eviction to balance recency and frequency

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on high cosine similarity between adjacent gate inputs; may be weaker for other architectures.
  • Evaluations run on two MoE models and two PC setups; generalization to other models/hardware is untested.
  • INT2 transfers can cause modest accuracy drops on low-bandwidth or low-memory machines.
  • System requires storing multiple quantized copies of experts in CPU memory.

When Not To Use

  • If your deployment can fit all experts on-device (no offloading needed).
  • For dense (non-MoE) models where expert routing does not apply.
  • If PCIe or bus bandwidth is extremely poor and overlap gains vanish.

Failure Modes

  • Misprefetch leads to on-demand I/O stalls that negate speed gains.
  • Aggressive INT2 quantization for popular experts causes visible accuracy drops.
  • EAP-style prediction fallback can be slower than naive loading when prediction accuracy is low.

Core Entities

Models

  • Qwen1.5-MoE
  • DeepseekMoE

Metrics

  • prefill speed (tokens/s)
  • decoding speed (tokens/s)
  • Accuracy
  • expert hit rate

Datasets

  • MMLU
  • GSM8K
  • HumanEval
  • ChatGPT-prompts

Benchmarks

  • MMLU
  • GSM8K
  • HumanEval

Context Entities

Models

  • GPT-4 (example)
  • Mixtral
  • DeepSpeed-MoE