Make large Mixture-of-Experts models run faster on edge GPUs by prefetching experts using adjacent-layer gate inputs

February 17, 20258 min

Overview

Decision SnapshotReady For Pilot

Experiments on Qwen1.5-MoE and DeepseekMoE across two hardware setups show consistent speedups and small accuracy loss; results are reproducible given similar models and PCIe characteristics.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

Links

Abstract / PDF

Why It Matters For Business

Fate cuts MoE inference latency on edge-class GPUs by combining cross-layer prefetch, targeted caching, and hybrid INT2/INT4 transfers—enabling richer, privacy-friendly on-device LLM features with small accuracy trade-offs.

Who Should Care

Summary TLDR

Fate is an offloading system for Mixture-of-Experts (MoE) language models that predicts which experts will be needed next by using the gate input from the previous layer (cross-layer prefetch). Combined with a shallow-favoring cache and a popularity-aware INT2/INT4 hybrid quantization, Fate raises prefetch accuracy to ~97%, increases expert hit rates to ~99%, and speeds up end-to-end inference on edge-class GPUs (e.g., up to 4.5× prefill and 4.1× decoding vs naive offload) while keeping accuracy loss small.

Problem Statement

MoE models activate only few experts per token but have massive total parameters, so edge GPUs cannot hold all experts. Offloading experts to CPU reduces memory needs but causes I/O stalls when experts are loaded on demand. Existing prefetch methods are either low-accuracy (activation-path) or require extra training (small predictor). We need a low-cost, high-accuracy prefetch + caching + quantization strategy that overlaps I/O with computation and respects prefill vs decoding differences.

Main Contribution

Cross-layer expert prefetch: use gate input from previous layer to predict next-layer experts with high accuracy, without extra training.

Shallow-favoring expert cache: allocate cache preferentially to early layers (where predictions are weaker) and use ARC eviction to reach very high hit rates.

Key Findings

Cross-layer prefetch achieves very high prediction accuracy without retraining.

Numbers97.15% prefetch accuracy (by transferring experts above 75th-confidence percentile)

Practical UseYou can predict next-layer experts with minimal CPU work and avoid training a separate predictor; use previous-layer gate inputs to prefetch experts.

Evidence RefSection 4.2; '97.15%' claim; Figure 4

Caching shallow layers yields almost complete cache hits.

Numbers99.08% expert hit rate when caching layers 03

Practical UseReserve limited GPU memory to fully cache first few MoE layers to eliminate most I/O stalls during inference.

Evidence RefSection 4.4; Figure 7 and experiment caching layers 0-3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
prefill speed (tokens/s)up to 804 tokens/s (DeepseekMoE, prompt 512, high-end PC)Load on Demandup to 4.5× vs Load on Demand; up to 1.9× vs EAPChatGPT-prompts / varied promptsSection 5.2.1; Figure 11Figure 11, Section 5.2.1
decoding speed (tokens/s)up to 14 tokens/s (Qwen1.5-MoE, high-end PC)Load on Demandup to 4.1× vs Load on Demand; up to 2.2× vs EAPdecoding workloads (input ≤64, output ≤1024)Section 5.2.2; Figure 12Figure 12, Section 5.2.2

What To Try In 7 Days

Measure cosine similarity of adjacent gate inputs in your MoE to validate cross-layer prefetch potential.

Implement a CPU-side prefetch that clones gate input and predicts next-layer experts to overlap I/O with GPU compute.

Prototype popularity-based prefill ordering and store INT2/INT4 variants in CPU to test hybrid quantized transfers.

Optimization Features

Token Efficiency
increases tokens/sec (prefill/decoding) via caching and overlapreorders computation to maximize I/O-compute overlap
Infra Optimization
offload experts to CPU and transfer to GPU on demanddesign works across PCIe 3.0 and older PCIe setups
Model Optimization
store INT4 and INT2 variants of experts for flexible transfersuse HQQ quantization for fast quant/dequant
System Optimization
clone intermediate state to CPU for parallel predictionoffline compute of per-layer timing constraints to bound prefetchmemory-budget aware cache sizing
Inference Optimization
cross-layer expert prefetch using previous-layer gate inputspopularity-aware ordering of expert transfers in prefillper-phase strategy: INT4 transfers in decoding, mixed INT2/INT4 in prefillshallow-favoring cache allocation (prioritize early layers)ARC eviction to balance recency and frequency

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Relies on high cosine similarity between adjacent gate inputs; may be weaker for other architectures.

Evaluations run on two MoE models and two PC setups; generalization to other models/hardware is untested.

When Not To Use

If your deployment can fit all experts on-device (no offloading needed).

For dense (non-MoE) models where expert routing does not apply.

Failure Modes

Misprefetch leads to on-demand I/O stalls that negate speed gains.

Aggressive INT2 quantization for popular experts causes visible accuracy drops.

Core Entities

Models

Qwen1.5-MoEDeepseekMoE

Metrics

prefill speed (tokens/s)decoding speed (tokens/s)Accuracyexpert hit rate

Datasets

MMLUGSM8KHumanEvalChatGPT-prompts

Benchmarks

MMLUGSM8KHumanEval

Context Entities

Models

GPT-4 (example)MixtralDeepSpeed-MoE