Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
HAQ plus CPU–GPU scheduling lets teams run large MoE models on consumer GPUs with near-full-precision accuracy, lowering GPU costs and enabling predictable, lower-latency edge services.
Summary TLDR
This paper presents HAQ: a practical pipeline that (1) adapts activation smoothing and uses Hessian-informed weight quantization to jointly quantize weights and activations to 8 bits, and (2) pairs that compression with a CPU–GPU expert offloading, predictor-driven routing, LRU GPU expert cache, and a two-stage expert placement policy. On OPT-series and Mixtral-8×7B (Wikitext2/C4) HAQ reaches near-FP16 perplexity while reducing GPU memory by ~60% and improving expert hit-rate stability versus baselines.
Problem Statement
MoE models are hard to run on edge hardware because (1) activation outliers break low-bit quantization accuracy, and (2) dynamic expert selection creates memory and transfer bottlenecks between CPU and GPU, causing high and variable latency.
Main Contribution
Hessian-Aware Quantization (HAQ): an adaptive activation smoothing plus Hessian-based weight quantizer for joint W8A8 quantization.
Precision-heterogeneous deployment: store/compute experts across CPU and GPU with dequantization-on-load to reduce runtime overhead.
CPU–GPU collaborative runtime: a lightweight latency predictor, LRU GPU expert cache, and a two-stage (path coverage + per-layer supplementation) expert placement strategy that balances hit rate and stability.
Key Findings
HAQ matches full-precision perplexity closely on Mixtral-8×7B
GPU memory usage falls substantially after quantization and heterogeneous placement
Two-stage expert placement (Scheme 3) keeps hit-rate high while stabilizing per-layer balance
Results
Perplexity (Mixtral-8×7B, Wikitext2)
Perplexity (Mixtral-8×7B, C4)
GPU memory usage
Expert hit-rate and stability (128 experts)
Inference latency std dev
Who Should Care
What To Try In 7 Days
Run HAQ-style W8A8 quantization on a small MoE model using LLMC to measure PPL vs FP16 on your data.
Profile expert activation paths on representative inputs and compute per-expert frequency histograms.
Implement a simple CPU vs GPU cost predictor (compare transfer time vs CPU compute) and test CPU-first default for single-token decoding.
Optimization Features
Token Efficiency
- prefill vs decoding-aware offloading decisions (batch-size aware)
Infra Optimization
- reduce VRAM usage ~60% to fit more experts on consumer GPUs
Model Optimization
- Hessian-aware weight quantization (GPTQ-like rows with compensation)
- adaptive activation smoothing (data-driven grid search)
System Optimization
- predictor-based dynamic offloading
- two-stage expert placement (path coverage + per-layer supplements)
Inference Optimization
- INT8 low-precision GEMM on GPU
- precision-heterogeneous storage (INT8 on CPU, INT8 on GPU, dequantize-on-load)
- LRU GPU expert cache
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments focus on OPT series and Mixtral-8×7B and two text datasets; results may vary on other models or modalities.
- Runtime gains depend on CPU compute and PCIe bandwidth; slow CPUs or different interconnects can invert the predictor decision.
- No public code or end-to-end benchmarks provided in paper, which limits immediate reproducibility.
When Not To Use
- If you have only a GPU with ample VRAM and no CPU assistance, the CPU–GPU scheme adds complexity with little benefit.
- When strict worst-case single-token latency is required and unpredictable cache misses are unacceptable.
- For non-MoE dense models where expert routing logic does not apply.
Failure Modes
- Activation distributions with extreme, dataset-specific outliers could still hurt joint quantization if calibration data is unrepresentative.
- Predictor misestimation can choose CPU computation when transfer-to-GPU would be better (or vice versa), causing latency spikes.
- Cache thrashing when real workload differs from offline activation statistics leads to frequent transfers and degraded performance.
Core Entities
Models
- OPT-6.7B
- OPT-13B
- OPT-30B
- Mixtral-8×7B
Metrics
- Perplexity (PPL)
- GPU memory usage
- Expert hit rate
- Inference latency (mean and std dev)
Datasets
- Wikitext2
- C4
Context Entities
Models
- LLaMA (referenced)
- other MoE references (EdgeMoE, Fiddler)
Metrics
- Perplexity
Datasets
- Wikitext2
- C4

