Overview
The method combines known Hessian PTQ ideas with an adaptive smoothing and a practical CPU–GPU runtime; experiments on multiple models/datasets support the claims but code is not provided and hardware specifics affect portability.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 5/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
HAQ plus CPU–GPU scheduling lets teams run large MoE models on consumer GPUs with near-full-precision accuracy, lowering GPU costs and enabling predictable, lower-latency edge services.
Who Should Care
Summary TLDR
This paper presents HAQ: a practical pipeline that (1) adapts activation smoothing and uses Hessian-informed weight quantization to jointly quantize weights and activations to 8 bits, and (2) pairs that compression with a CPU–GPU expert offloading, predictor-driven routing, LRU GPU expert cache, and a two-stage expert placement policy. On OPT-series and Mixtral-8×7B (Wikitext2/C4) HAQ reaches near-FP16 perplexity while reducing GPU memory by ~60% and improving expert hit-rate stability versus baselines.
Problem Statement
MoE models are hard to run on edge hardware because (1) activation outliers break low-bit quantization accuracy, and (2) dynamic expert selection creates memory and transfer bottlenecks between CPU and GPU, causing high and variable latency.
Main Contribution
Hessian-Aware Quantization (HAQ): an adaptive activation smoothing plus Hessian-based weight quantizer for joint W8A8 quantization.
Precision-heterogeneous deployment: store/compute experts across CPU and GPU with dequantization-on-load to reduce runtime overhead.
Key Findings
HAQ matches full-precision perplexity closely on Mixtral-8×7B
GPU memory usage falls substantially after quantization and heterogeneous placement
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (Mixtral-8×7B, Wikitext2) | HAQ 3.864; FP16 3.840 | FP16 | +0.024 | Wikitext2 | Table II | Table II |
| Perplexity (Mixtral-8×7B, C4) | HAQ 7.427; FP16 7.401 | FP16 | +0.026 | C4 | Table II | Table II |
What To Try In 7 Days
Run HAQ-style W8A8 quantization on a small MoE model using LLMC to measure PPL vs FP16 on your data.
Profile expert activation paths on representative inputs and compute per-expert frequency histograms.
Implement a simple CPU vs GPU cost predictor (compare transfer time vs CPU compute) and test CPU-first default for single-token decoding.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments focus on OPT series and Mixtral-8×7B and two text datasets; results may vary on other models or modalities.
Runtime gains depend on CPU compute and PCIe bandwidth; slow CPUs or different interconnects can invert the predictor decision.
When Not To Use
If you have only a GPU with ample VRAM and no CPU assistance, the CPU–GPU scheme adds complexity with little benefit.
When strict worst-case single-token latency is required and unpredictable cache misses are unacceptable.
Failure Modes
Activation distributions with extreme, dataset-specific outliers could still hurt joint quantization if calibration data is unrepresentative.
Predictor misestimation can choose CPU computation when transfer-to-GPU would be better (or vice versa), causing latency spikes.

