Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Fate cuts MoE inference latency on edge-class GPUs by combining cross-layer prefetch, targeted caching, and hybrid INT2/INT4 transfers—enabling richer, privacy-friendly on-device LLM features with small accuracy trade-offs.
Summary TLDR
Fate is an offloading system for Mixture-of-Experts (MoE) language models that predicts which experts will be needed next by using the gate input from the previous layer (cross-layer prefetch). Combined with a shallow-favoring cache and a popularity-aware INT2/INT4 hybrid quantization, Fate raises prefetch accuracy to ~97%, increases expert hit rates to ~99%, and speeds up end-to-end inference on edge-class GPUs (e.g., up to 4.5× prefill and 4.1× decoding vs naive offload) while keeping accuracy loss small.
Problem Statement
MoE models activate only few experts per token but have massive total parameters, so edge GPUs cannot hold all experts. Offloading experts to CPU reduces memory needs but causes I/O stalls when experts are loaded on demand. Existing prefetch methods are either low-accuracy (activation-path) or require extra training (small predictor). We need a low-cost, high-accuracy prefetch + caching + quantization strategy that overlaps I/O with computation and respects prefill vs decoding differences.
Main Contribution
Cross-layer expert prefetch: use gate input from previous layer to predict next-layer experts with high accuracy, without extra training.
Shallow-favoring expert cache: allocate cache preferentially to early layers (where predictions are weaker) and use ARC eviction to reach very high hit rates.
Popularity-aware hybrid quantization: store INT4/INT2 variants and transfer INT2 for low-popularity experts during prefill to reduce I/O while bounding accuracy loss.
Key Findings
Cross-layer prefetch achieves very high prediction accuracy without retraining.
Caching shallow layers yields almost complete cache hits.
End-to-end speedups are large versus naive offloading and prior prefetching.
Accuracy impact is small on well-provisioned hardware.
Prefill and decoding require different optimizations.
Results
prefill speed (tokens/s)
decoding speed (tokens/s)
Accuracy
expert cache hit rate
Accuracy
Who Should Care
What To Try In 7 Days
Measure cosine similarity of adjacent gate inputs in your MoE to validate cross-layer prefetch potential.
Implement a CPU-side prefetch that clones gate input and predicts next-layer experts to overlap I/O with GPU compute.
Prototype popularity-based prefill ordering and store INT2/INT4 variants in CPU to test hybrid quantized transfers.
Optimization Features
Token Efficiency
- increases tokens/sec (prefill/decoding) via caching and overlap
- reorders computation to maximize I/O-compute overlap
Infra Optimization
- offload experts to CPU and transfer to GPU on demand
- design works across PCIe 3.0 and older PCIe setups
Model Optimization
- store INT4 and INT2 variants of experts for flexible transfers
- use HQQ quantization for fast quant/dequant
System Optimization
- clone intermediate state to CPU for parallel prediction
- offline compute of per-layer timing constraints to bound prefetch
- memory-budget aware cache sizing
Inference Optimization
- cross-layer expert prefetch using previous-layer gate inputs
- popularity-aware ordering of expert transfers in prefill
- per-phase strategy: INT4 transfers in decoding, mixed INT2/INT4 in prefill
- shallow-favoring cache allocation (prioritize early layers)
- ARC eviction to balance recency and frequency
Reproducibility
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Relies on high cosine similarity between adjacent gate inputs; may be weaker for other architectures.
- Evaluations run on two MoE models and two PC setups; generalization to other models/hardware is untested.
- INT2 transfers can cause modest accuracy drops on low-bandwidth or low-memory machines.
- System requires storing multiple quantized copies of experts in CPU memory.
When Not To Use
- If your deployment can fit all experts on-device (no offloading needed).
- For dense (non-MoE) models where expert routing does not apply.
- If PCIe or bus bandwidth is extremely poor and overlap gains vanish.
Failure Modes
- Misprefetch leads to on-demand I/O stalls that negate speed gains.
- Aggressive INT2 quantization for popular experts causes visible accuracy drops.
- EAP-style prediction fallback can be slower than naive loading when prediction accuracy is low.
Core Entities
Models
- Qwen1.5-MoE
- DeepseekMoE
Metrics
- prefill speed (tokens/s)
- decoding speed (tokens/s)
- Accuracy
- expert hit rate
Datasets
- MMLU
- GSM8K
- HumanEval
- ChatGPT-prompts
Benchmarks
- MMLU
- GSM8K
- HumanEval
Context Entities
Models
- GPT-4 (example)
- Mixtral
- DeepSpeed-MoE

