Joint Hessian-aware 8-bit quantization plus CPU–GPU expert scheduling for MoE edge deployment

August 10, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang

Links

Abstract / PDF

Why It Matters For Business

HAQ plus CPU–GPU scheduling lets teams run large MoE models on consumer GPUs with near-full-precision accuracy, lowering GPU costs and enabling predictable, lower-latency edge services.

Summary TLDR

This paper presents HAQ: a practical pipeline that (1) adapts activation smoothing and uses Hessian-informed weight quantization to jointly quantize weights and activations to 8 bits, and (2) pairs that compression with a CPU–GPU expert offloading, predictor-driven routing, LRU GPU expert cache, and a two-stage expert placement policy. On OPT-series and Mixtral-8×7B (Wikitext2/C4) HAQ reaches near-FP16 perplexity while reducing GPU memory by ~60% and improving expert hit-rate stability versus baselines.

Problem Statement

MoE models are hard to run on edge hardware because (1) activation outliers break low-bit quantization accuracy, and (2) dynamic expert selection creates memory and transfer bottlenecks between CPU and GPU, causing high and variable latency.

Main Contribution

Hessian-Aware Quantization (HAQ): an adaptive activation smoothing plus Hessian-based weight quantizer for joint W8A8 quantization.

Precision-heterogeneous deployment: store/compute experts across CPU and GPU with dequantization-on-load to reduce runtime overhead.

CPU–GPU collaborative runtime: a lightweight latency predictor, LRU GPU expert cache, and a two-stage (path coverage + per-layer supplementation) expert placement strategy that balances hit rate and stability.

Key Findings

HAQ matches full-precision perplexity closely on Mixtral-8×7B

NumbersWikitext2: FP16 3.840 vs HAQ 3.864; C4: FP16 7.401 vs HAQ 7.427

GPU memory usage falls substantially after quantization and heterogeneous placement

NumbersPaper reports ~60% GPU memory reduction

Two-stage expert placement (Scheme 3) keeps hit-rate high while stabilizing per-layer balance

Numbers128 experts: mean hit-rate 56.6% vs 57.9% (Scheme 2); std dev 3.7% vs 11.9%

Results

Perplexity (Mixtral-8×7B, Wikitext2)

ValueHAQ 3.864; FP16 3.840

BaselineFP16

Perplexity (Mixtral-8×7B, C4)

ValueHAQ 7.427; FP16 7.401

BaselineFP16

GPU memory usage

ValueReduced by ~60%

Baselinefull-precision deployment

Expert hit-rate and stability (128 experts)

ValueScheme3 mean hit-rate 56.6%, std dev 3.7%

BaselineScheme2 mean 57.9%, std dev 11.9%

Inference latency std dev

ValueScheme3 std dev 52% lower than Scheme2 (p<0.01)

BaselineScheme2

Who Should Care

What To Try In 7 Days

Run HAQ-style W8A8 quantization on a small MoE model using LLMC to measure PPL vs FP16 on your data.

Profile expert activation paths on representative inputs and compute per-expert frequency histograms.

Implement a simple CPU vs GPU cost predictor (compare transfer time vs CPU compute) and test CPU-first default for single-token decoding.

Optimization Features

Token Efficiency

  • prefill vs decoding-aware offloading decisions (batch-size aware)

Infra Optimization

  • reduce VRAM usage ~60% to fit more experts on consumer GPUs

Model Optimization

  • Hessian-aware weight quantization (GPTQ-like rows with compensation)
  • adaptive activation smoothing (data-driven grid search)

System Optimization

  • predictor-based dynamic offloading
  • two-stage expert placement (path coverage + per-layer supplements)

Inference Optimization

  • INT8 low-precision GEMM on GPU
  • precision-heterogeneous storage (INT8 on CPU, INT8 on GPU, dequantize-on-load)
  • LRU GPU expert cache

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments focus on OPT series and Mixtral-8×7B and two text datasets; results may vary on other models or modalities.
  • Runtime gains depend on CPU compute and PCIe bandwidth; slow CPUs or different interconnects can invert the predictor decision.
  • No public code or end-to-end benchmarks provided in paper, which limits immediate reproducibility.

When Not To Use

  • If you have only a GPU with ample VRAM and no CPU assistance, the CPU–GPU scheme adds complexity with little benefit.
  • When strict worst-case single-token latency is required and unpredictable cache misses are unacceptable.
  • For non-MoE dense models where expert routing logic does not apply.

Failure Modes

  • Activation distributions with extreme, dataset-specific outliers could still hurt joint quantization if calibration data is unrepresentative.
  • Predictor misestimation can choose CPU computation when transfer-to-GPU would be better (or vice versa), causing latency spikes.
  • Cache thrashing when real workload differs from offline activation statistics leads to frequent transfers and degraded performance.

Core Entities

Models

  • OPT-6.7B
  • OPT-13B
  • OPT-30B
  • Mixtral-8×7B

Metrics

  • Perplexity (PPL)
  • GPU memory usage
  • Expert hit rate
  • Inference latency (mean and std dev)

Datasets

  • Wikitext2
  • C4

Context Entities

Models

  • LLaMA (referenced)
  • other MoE references (EdgeMoE, Fiddler)

Metrics

  • Perplexity

Datasets

  • Wikitext2
  • C4