Make large Mixture-of-Experts models run faster on edge GPUs by prefetching experts using adjacent-layer gate inputs

Overview

Decision SnapshotReady For Pilot

Experiments on Qwen1.5-MoE and DeepseekMoE across two hardware setups show consistent speedups and small accuracy loss; results are reproducible given similar models and PCIe characteristics.

Citations0

Evidence Strength0.80

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zhiyuan Fang, Zicong Hong, Yuegui Huang, Yufeng Lyu, Wuhui Chen, Yue Yu, Fan Yu, Zibin Zheng

Links

Abstract / PDF

Why It Matters For Business

Fate cuts MoE inference latency on edge-class GPUs by combining cross-layer prefetch, targeted caching, and hybrid INT2/INT4 transfers—enabling richer, privacy-friendly on-device LLM features with small accuracy trade-offs.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

Fate is an offloading system for Mixture-of-Experts (MoE) language models that predicts which experts will be needed next by using the gate input from the previous layer (cross-layer prefetch). Combined with a shallow-favoring cache and a popularity-aware INT2/INT4 hybrid quantization, Fate raises prefetch accuracy to ~97%, increases expert hit rates to ~99%, and speeds up end-to-end inference on edge-class GPUs (e.g., up to 4.5× prefill and 4.1× decoding vs naive offload) while keeping accuracy loss small.

Problem Statement

MoE models activate only few experts per token but have massive total parameters, so edge GPUs cannot hold all experts. Offloading experts to CPU reduces memory needs but causes I/O stalls when experts are loaded on demand. Existing prefetch methods are either low-accuracy (activation-path) or require extra training (small predictor). We need a low-cost, high-accuracy prefetch + caching + quantization strategy that overlaps I/O with computation and respects prefill vs decoding differences.

Main Contribution

Cross-layer expert prefetch: use gate input from previous layer to predict next-layer experts with high accuracy, without extra training.

Shallow-favoring expert cache: allocate cache preferentially to early layers (where predictions are weaker) and use ARC eviction to reach very high hit rates.

Key Findings

Cross-layer prefetch achieves very high prediction accuracy without retraining.

Numbers97.15% prefetch accuracy (by transferring experts above 75th-confidence percentile)

Practical UseYou can predict next-layer experts with minimal CPU work and avoid training a separate predictor; use previous-layer gate inputs to prefetch experts.

Evidence RefSection 4.2; '97.15%' claim; Figure 4

Caching shallow layers yields almost complete cache hits.

Numbers99.08% expert hit rate when caching layers 0–3

Practical UseReserve limited GPU memory to fully cache first few MoE layers to eliminate most I/O stalls during inference.

Evidence RefSection 4.4; Figure 7 and experiment caching layers 0-3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
prefill speed (tokens/s)	up to 804 tokens/s (DeepseekMoE, prompt 512, high-end PC)	Load on Demand	up to 4.5× vs Load on Demand; up to 1.9× vs EAP	ChatGPT-prompts / varied prompts	Section 5.2.1; Figure 11	Figure 11, Section 5.2.1
decoding speed (tokens/s)	up to 14 tokens/s (Qwen1.5-MoE, high-end PC)	Load on Demand	up to 4.1× vs Load on Demand; up to 2.2× vs EAP	decoding workloads (input ≤64, output ≤1024)	Section 5.2.2; Figure 12	Figure 12, Section 5.2.2

What To Try In 7 Days

Measure cosine similarity of adjacent gate inputs in your MoE to validate cross-layer prefetch potential.

Implement a CPU-side prefetch that clones gate input and predicts next-layer experts to overlap I/O with GPU compute.

Prototype popularity-based prefill ordering and store INT2/INT4 variants in CPU to test hybrid quantized transfers.

Optimization Features

Token Efficiency

increases tokens/sec (prefill/decoding) via caching and overlapreorders computation to maximize I/O-compute overlap

Infra Optimization

offload experts to CPU and transfer to GPU on demanddesign works across PCIe 3.0 and older PCIe setups

Model Optimization

store INT4 and INT2 variants of experts for flexible transfersuse HQQ quantization for fast quant/dequant

System Optimization

clone intermediate state to CPU for parallel predictionoffline compute of per-layer timing constraints to bound prefetchmemory-budget aware cache sizing

Inference Optimization

cross-layer expert prefetch using previous-layer gate inputspopularity-aware ordering of expert transfers in prefillper-phase strategy: INT4 transfers in decoding, mixed INT2/INT4 in prefillshallow-favoring cache allocation (prioritize early layers)ARC eviction to balance recency and frequency

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Relies on high cosine similarity between adjacent gate inputs; may be weaker for other architectures.

Evaluations run on two MoE models and two PC setups; generalization to other models/hardware is untested.

When Not To Use

If your deployment can fit all experts on-device (no offloading needed).

For dense (non-MoE) models where expert routing does not apply.

Failure Modes

Misprefetch leads to on-demand I/O stalls that negate speed gains.

Aggressive INT2 quantization for popular experts causes visible accuracy drops.

Core Entities

Models

Qwen1.5-MoEDeepseekMoE

Metrics

prefill speed (tokens/s)decoding speed (tokens/s)Accuracyexpert hit rate

Datasets

MMLUGSM8KHumanEvalChatGPT-prompts

Benchmarks

MMLUGSM8KHumanEval

Context Entities

Models

GPT-4 (example)MixtralDeepSpeed-MoE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Cross-layer prefetch achieves very high prediction accuracy without retraining.

Caching shallow layers yields almost complete cache hits.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

Cut MoE batch decoding latency by re-routing tokens to similar experts with a one-line vLLM change

Key finding

Fine-grained Mixture-of-Experts (G=8) cuts training steps and improves accuracy at 56B scale

Key finding

Post-training NAS (Puzzle) compresses gpt-oss-120B into gpt-oss-puzzle-88B to cut KV-cache and MoE costs while retaining reasoning quality

Key finding