Joint Hessian-aware 8-bit quantization plus CPU–GPU expert scheduling for MoE edge deployment

Overview

Decision SnapshotReady For Pilot

The method combines known Hessian PTQ ideas with an adaptive smoothing and a practical CPU–GPU runtime; experiments on multiple models/datasets support the claims but code is not provided and hardware specifics affect portability.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang

Links

Abstract / PDF

Why It Matters For Business

HAQ plus CPU–GPU scheduling lets teams run large MoE models on consumer GPUs with near-full-precision accuracy, lowering GPU costs and enabling predictable, lower-latency edge services.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager

Summary TLDR

This paper presents HAQ: a practical pipeline that (1) adapts activation smoothing and uses Hessian-informed weight quantization to jointly quantize weights and activations to 8 bits, and (2) pairs that compression with a CPU–GPU expert offloading, predictor-driven routing, LRU GPU expert cache, and a two-stage expert placement policy. On OPT-series and Mixtral-8×7B (Wikitext2/C4) HAQ reaches near-FP16 perplexity while reducing GPU memory by ~60% and improving expert hit-rate stability versus baselines.

Problem Statement

MoE models are hard to run on edge hardware because (1) activation outliers break low-bit quantization accuracy, and (2) dynamic expert selection creates memory and transfer bottlenecks between CPU and GPU, causing high and variable latency.

Main Contribution

Hessian-Aware Quantization (HAQ): an adaptive activation smoothing plus Hessian-based weight quantizer for joint W8A8 quantization.

Precision-heterogeneous deployment: store/compute experts across CPU and GPU with dequantization-on-load to reduce runtime overhead.

Key Findings

HAQ matches full-precision perplexity closely on Mixtral-8×7B

NumbersWikitext2: FP16 3.840 vs HAQ 3.864; C4: FP16 7.401 vs HAQ 7.427

Practical UseYou can run W8A8 inference with almost no PPL loss on evaluated models/datasets, enabling 8-bit deployments without major accuracy trade-offs.

Evidence RefTable II; Abstract

GPU memory usage falls substantially after quantization and heterogeneous placement

NumbersPaper reports ~60% GPU memory reduction

Practical UseExpect to fit many more experts on a consumer GPU; plan deployment around reduced VRAM footprint rather than full-precision sizes.

Evidence RefAbstract; Conclusion

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (Mixtral-8×7B, Wikitext2)	HAQ 3.864; FP16 3.840	FP16	+0.024	Wikitext2	Table II	Table II
Perplexity (Mixtral-8×7B, C4)	HAQ 7.427; FP16 7.401	FP16	+0.026	C4	Table II	Table II

What To Try In 7 Days

Run HAQ-style W8A8 quantization on a small MoE model using LLMC to measure PPL vs FP16 on your data.

Profile expert activation paths on representative inputs and compute per-expert frequency histograms.

Implement a simple CPU vs GPU cost predictor (compare transfer time vs CPU compute) and test CPU-first default for single-token decoding.

Optimization Features

Token Efficiency

prefill vs decoding-aware offloading decisions (batch-size aware)

Infra Optimization

reduce VRAM usage ~60% to fit more experts on consumer GPUs

Model Optimization

Hessian-aware weight quantization (GPTQ-like rows with compensation)adaptive activation smoothing (data-driven grid search)

System Optimization

predictor-based dynamic offloadingtwo-stage expert placement (path coverage + per-layer supplements)

Inference Optimization

INT8 low-precision GEMM on GPUprecision-heterogeneous storage (INT8 on CPU, INT8 on GPU, dequantize-on-load)LRU GPU expert cache

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Experiments focus on OPT series and Mixtral-8×7B and two text datasets; results may vary on other models or modalities.

Runtime gains depend on CPU compute and PCIe bandwidth; slow CPUs or different interconnects can invert the predictor decision.

When Not To Use

If you have only a GPU with ample VRAM and no CPU assistance, the CPU–GPU scheme adds complexity with little benefit.

When strict worst-case single-token latency is required and unpredictable cache misses are unacceptable.

Failure Modes

Activation distributions with extreme, dataset-specific outliers could still hurt joint quantization if calibration data is unrepresentative.

Predictor misestimation can choose CPU computation when transfer-to-GPU would be better (or vice versa), causing latency spikes.

Core Entities

Models

OPT-6.7BOPT-13BOPT-30BMixtral-8×7B

Metrics

Perplexity (PPL)GPU memory usageExpert hit rateInference latency (mean and std dev)

Datasets

Wikitext2C4

Context Entities

Models

LLaMA (referenced)other MoE references (EdgeMoE, Fiddler)

Metrics

Perplexity

Datasets

Wikitext2C4

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

HAQ matches full-precision perplexity closely on Mixtral-8×7B

GPU memory usage falls substantially after quantization and heterogeneous placement

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding