Overview
Experiments on Qwen1.5-MoE and DeepseekMoE across two hardware setups show consistent speedups and small accuracy loss; results are reproducible given similar models and PCIe characteristics.
Citations0
Evidence Strength0.80
Confidence0.82
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
Fate cuts MoE inference latency on edge-class GPUs by combining cross-layer prefetch, targeted caching, and hybrid INT2/INT4 transfers—enabling richer, privacy-friendly on-device LLM features with small accuracy trade-offs.
Who Should Care
Summary TLDR
Fate is an offloading system for Mixture-of-Experts (MoE) language models that predicts which experts will be needed next by using the gate input from the previous layer (cross-layer prefetch). Combined with a shallow-favoring cache and a popularity-aware INT2/INT4 hybrid quantization, Fate raises prefetch accuracy to ~97%, increases expert hit rates to ~99%, and speeds up end-to-end inference on edge-class GPUs (e.g., up to 4.5× prefill and 4.1× decoding vs naive offload) while keeping accuracy loss small.
Problem Statement
MoE models activate only few experts per token but have massive total parameters, so edge GPUs cannot hold all experts. Offloading experts to CPU reduces memory needs but causes I/O stalls when experts are loaded on demand. Existing prefetch methods are either low-accuracy (activation-path) or require extra training (small predictor). We need a low-cost, high-accuracy prefetch + caching + quantization strategy that overlaps I/O with computation and respects prefill vs decoding differences.
Main Contribution
Cross-layer expert prefetch: use gate input from previous layer to predict next-layer experts with high accuracy, without extra training.
Shallow-favoring expert cache: allocate cache preferentially to early layers (where predictions are weaker) and use ARC eviction to reach very high hit rates.
Key Findings
Cross-layer prefetch achieves very high prediction accuracy without retraining.
Caching shallow layers yields almost complete cache hits.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| prefill speed (tokens/s) | up to 804 tokens/s (DeepseekMoE, prompt 512, high-end PC) | Load on Demand | up to 4.5× vs Load on Demand; up to 1.9× vs EAP | ChatGPT-prompts / varied prompts | Section 5.2.1; Figure 11 | Figure 11, Section 5.2.1 |
| decoding speed (tokens/s) | up to 14 tokens/s (Qwen1.5-MoE, high-end PC) | Load on Demand | up to 4.1× vs Load on Demand; up to 2.2× vs EAP | decoding workloads (input ≤64, output ≤1024) | Section 5.2.2; Figure 12 | Figure 12, Section 5.2.2 |
What To Try In 7 Days
Measure cosine similarity of adjacent gate inputs in your MoE to validate cross-layer prefetch potential.
Implement a CPU-side prefetch that clones gate input and predicts next-layer experts to overlap I/O with GPU compute.
Prototype popularity-based prefill ordering and store INT2/INT4 variants in CPU to test hybrid quantized transfers.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Relies on high cosine similarity between adjacent gate inputs; may be weaker for other architectures.
Evaluations run on two MoE models and two PC setups; generalization to other models/hardware is untested.
When Not To Use
If your deployment can fit all experts on-device (no offloading needed).
For dense (non-MoE) models where expert routing does not apply.
Failure Modes
Misprefetch leads to on-demand I/O stalls that negate speed gains.
Aggressive INT2 quantization for popular experts causes visible accuracy drops.

