LLMEasyQuant: modular, hardware-aware quantization runtime for multi‑GPU and distributed LLM serving

June 28, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Dong Liu, Yanxuan Yu

Links

Abstract / PDF

Why It Matters For Business

LLMEasyQuant lowers calibration and deployment overhead, lets you fit larger models on the same GPUs, and delivers small steady throughput gains—helpful when you must amortize expensive GPU fleets.

Summary TLDR

LLMEasyQuant is a modular quantization toolkit and runtime that bundles multiple post-training quantizers (Symmetric, ZeroQuant, SmoothQuant, SimQuant, AWQ, GPTQ) with fused CUDA kernels, NCCL synchronization, and online scaling. It targets single-node multi‑GPU, multi‑node, and edge deployments. On evaluated models (GPT-2, LLaMA, Mistral, Qwen), it reports near-linear multi-GPU scaling, small perplexity loss for 8-bit quantization, optional mixed-precision model size reductions up to 3.2×, and modest throughput gains (~1–1.5%) over competing toolkits on measured benchmarks.

Problem Statement

Existing quantization toolkits are often opaque, hardware-tied, and hard to customize. That makes it slow to experiment and hard to scale quantized LLM inference across multi‑GPU or distributed setups.

Main Contribution

A modular quantization library that unifies Symmetric, ZeroQuant, SmoothQuant, SimQuant, AWQ, and GPTQ under a single API.

A system-aware runtime: fused CUDA kernels, NCCL-based synchronization, asynchronous per-shard quantization, and ONNX export for edge runtimes.

A broad empirical study on GPT-2, LLaMA-7B/13B, Mistral-7B, Qwen3-14B showing tradeoffs in perplexity, latency, memory, and calibration size.

Key Findings

LLMEasyQuant achieves 2,156 tokens/s on LLaMA-7B with INT8 quantization.

NumbersThroughput 2,156 tok/s (LLaMA-7B, 8K context)

Throughput gains over other quantizers are small but consistent.

Numbers1.0–1.5% throughput improvement vs GPTQ/AWQ/TensorRT on evaluated models

Calibration data and setup time requirements are much lower in reported runs.

NumbersCalibration data reduced to 16 samples (vs 128) and setup time reduced by ~33% for GPT-2 117M

Mixed-precision bitwidth search can shrink model storage up to 3.2× with acceptable accuracy loss.

NumbersUp to 3.2× model size reduction reported (mixed-precision search)

SmoothQuant and SimQuant reduce perplexity vs baseline 8-bit quantization in experiments.

NumbersUp to 20% relative perplexity reduction vs baseline 8-bit in experiments

Results

Throughput (LLaMA-7B)

Value2,156 tok/s (LLMEasyQuant INT8)

BaselineGPTQ 1,987 tok/s

Perplexity (GPT-2 117M)

Value6.31 (LLMEasyQuant with SmoothQuant)

BaselineGPTQ 7.23

Calibration data required (GPT-2 117M)

Value16 samples (LLMEasyQuant default)

BaselineTensorRT-LLM 128 samples

Setup time (GPT-2 117M)

Value2 min (LLMEasyQuant)

BaselineTensorRT-LLM 3 min

Model storage reduction (mixed-precision)

ValueUp to 3.2× smaller model

BaselineFP16 storage

Who Should Care

What To Try In 7 Days

Install LLMEasyQuant and quantize a small model (GPT-2) to reproduce throughput/perplexity numbers.

Run per-layer mixed-precision search to trade storage for accuracy on a target model.

Enable fused kernels on your GPU cluster and measure end-to-end latency under your workloads.

Optimization Features

Infra Optimization

  • multi-node RDMA/InfiniBand support
  • PyTorch DDP and TCP fallback
  • Tensor Core INT8 utilization

Model Optimization

  • post-training weight quantization
  • activation-aware calibration
  • mixed-precision per-layer bitwidth search
  • per-channel scaling (SmoothQuant)

System Optimization

  • NCCL-based global synchronization
  • HBM↔SRAM tiling and fused I/O
  • ONNX-compatible quantized exports

Inference Optimization

  • fused quantize+GEMM CUDA kernels
  • online adaptive scaling for activations
  • KV-cache quantization (SimQuant)

Reproducibility

Data Urls

  • WikiText-2 (validation) as used in experiments

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on CUDA/NCCL and GPU intrinsics; portability to non-CUDA hardware is limited.
  • Some reported memory numbers are higher in tables; fused kernels and dequantization can increase peak memory in certain layouts.
  • Results use standard LM perplexity and throughput; other downstream tasks (e.g., instruction-following) are not measured here.

When Not To Use

  • On non-CUDA or non-NCCL environments where fused kernels are unavailable.
  • If you need exact FP16/FP32 fidelity for sensitive tasks.
  • For tiny models where quantization gains are negligible.

Failure Modes

  • Poor calibration samples can cause large perplexity degradation.
  • Non-NCCL networking or interrupted AllGather can break distributed consistency.
  • Fused kernel register/shared-memory pressure may reduce occupancy on some GPUs.

Core Entities

Models

  • GPT-2 (117M, 345M)
  • LLaMA-7B
  • LLaMA-13B
  • Mistral-7B
  • Qwen3-14B

Metrics

  • Perplexity
  • Throughput (tok/s)
  • Memory (GB)
  • Setup Time (min)
  • Calibration Data (samples)

Datasets

  • WikiText-2 (validation)

Benchmarks

  • perplexity
  • throughput (tokens/s)
  • memory (GB)
  • setup time (min)
  • calibration data (samples)

Context Entities

Models

  • AWQ
  • GPTQ
  • TensorRT-LLM
  • ZeroQuant
  • SmoothQuant
  • SimQuant