Joint Hessian-aware 8-bit quantization plus CPU–GPU expert scheduling for MoE edge deployment

August 10, 20257 min

Overview

Decision SnapshotReady For Pilot

The method combines known Hessian PTQ ideas with an adaptive smoothing and a practical CPU–GPU runtime; experiments on multiple models/datasets support the claims but code is not provided and hardware specifics affect portability.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang

Links

Abstract / PDF

Why It Matters For Business

HAQ plus CPU–GPU scheduling lets teams run large MoE models on consumer GPUs with near-full-precision accuracy, lowering GPU costs and enabling predictable, lower-latency edge services.

Who Should Care

Summary TLDR

This paper presents HAQ: a practical pipeline that (1) adapts activation smoothing and uses Hessian-informed weight quantization to jointly quantize weights and activations to 8 bits, and (2) pairs that compression with a CPU–GPU expert offloading, predictor-driven routing, LRU GPU expert cache, and a two-stage expert placement policy. On OPT-series and Mixtral-8×7B (Wikitext2/C4) HAQ reaches near-FP16 perplexity while reducing GPU memory by ~60% and improving expert hit-rate stability versus baselines.

Problem Statement

MoE models are hard to run on edge hardware because (1) activation outliers break low-bit quantization accuracy, and (2) dynamic expert selection creates memory and transfer bottlenecks between CPU and GPU, causing high and variable latency.

Main Contribution

Hessian-Aware Quantization (HAQ): an adaptive activation smoothing plus Hessian-based weight quantizer for joint W8A8 quantization.

Precision-heterogeneous deployment: store/compute experts across CPU and GPU with dequantization-on-load to reduce runtime overhead.

Key Findings

HAQ matches full-precision perplexity closely on Mixtral-8×7B

NumbersWikitext2: FP16 3.840 vs HAQ 3.864; C4: FP16 7.401 vs HAQ 7.427

Practical UseYou can run W8A8 inference with almost no PPL loss on evaluated models/datasets, enabling 8-bit deployments without major accuracy trade-offs.

Evidence RefTable II; Abstract

GPU memory usage falls substantially after quantization and heterogeneous placement

NumbersPaper reports ~60% GPU memory reduction

Practical UseExpect to fit many more experts on a consumer GPU; plan deployment around reduced VRAM footprint rather than full-precision sizes.

Evidence RefAbstract; Conclusion

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (Mixtral-8×7B, Wikitext2)HAQ 3.864; FP16 3.840FP16+0.024Wikitext2Table IITable II
Perplexity (Mixtral-8×7B, C4)HAQ 7.427; FP16 7.401FP16+0.026C4Table IITable II

What To Try In 7 Days

Run HAQ-style W8A8 quantization on a small MoE model using LLMC to measure PPL vs FP16 on your data.

Profile expert activation paths on representative inputs and compute per-expert frequency histograms.

Implement a simple CPU vs GPU cost predictor (compare transfer time vs CPU compute) and test CPU-first default for single-token decoding.

Optimization Features

Token Efficiency
prefill vs decoding-aware offloading decisions (batch-size aware)
Infra Optimization
reduce VRAM usage ~60% to fit more experts on consumer GPUs
Model Optimization
Hessian-aware weight quantization (GPTQ-like rows with compensation)adaptive activation smoothing (data-driven grid search)
System Optimization
predictor-based dynamic offloadingtwo-stage expert placement (path coverage + per-layer supplements)
Inference Optimization
INT8 low-precision GEMM on GPUprecision-heterogeneous storage (INT8 on CPU, INT8 on GPU, dequantize-on-load)LRU GPU expert cache

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments focus on OPT series and Mixtral-8×7B and two text datasets; results may vary on other models or modalities.

Runtime gains depend on CPU compute and PCIe bandwidth; slow CPUs or different interconnects can invert the predictor decision.

When Not To Use

If you have only a GPU with ample VRAM and no CPU assistance, the CPU–GPU scheme adds complexity with little benefit.

When strict worst-case single-token latency is required and unpredictable cache misses are unacceptable.

Failure Modes

Activation distributions with extreme, dataset-specific outliers could still hurt joint quantization if calibration data is unrepresentative.

Predictor misestimation can choose CPU computation when transfer-to-GPU would be better (or vice versa), causing latency spikes.

Core Entities

Models

OPT-6.7BOPT-13BOPT-30BMixtral-8×7B

Metrics

Perplexity (PPL)GPU memory usageExpert hit rateInference latency (mean and std dev)

Datasets

Wikitext2C4

Context Entities

Models

LLaMA (referenced)other MoE references (EdgeMoE, Fiddler)

Metrics

Perplexity

Datasets

Wikitext2C4