Combine per-group 4-bit quantization with GPU-friendly group sparsity to speed LLM decoding with small accuracy loss.

December 23, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Lean Fu, Xing Mei

Links

Abstract / PDF

Why It Matters For Business

GQSA delivers multi× inference speedups and big memory savings for 7B–30B class LLMs while often preserving or improving zero-shot accuracy, enabling cheaper, faster serving on GPUs and more viable edge deployment.

Summary TLDR

GQSA is a compression + engine co-design that pairs per-group weight quantization (INT4) with structured group sparsity and a custom GPU kernel. It uses two optimization stages (block-wise BQPO then global E2E-OQP tuning of quant params) and a task-centric parallel engine. On LLaMA / Qwen / OPT family tests, GQSA (W4 + 40–50% sparsity) often matches or beats W2 and 2:4 baselines in accuracy while cutting latency and increasing throughput (examples: ~3.7× tokens/sec vs FP16 and up to ~4× latency reduction vs FP16 on LLaMA-7B in some settings).

Problem Statement

Decoding in LLMs (many GEMV operations) is the inference bottleneck. Existing semi-structured sparsity (2:4) is tied to NVIDIA TensorCore shapes and incompatible with weight-only quantization, limiting speedups and memory gains. There is a need for a sparsity+quantization scheme that: accelerates GEMV, works with weight-only per-group quantization, and balances accuracy vs compression.

Main Contribution

Introduce GQSA: group (row-wise) sparsity compatible with weight-only per-group quantization and stored in BSR block format.

Two-stage optimization: BQPO (block-wise weight recovery) and E2E-OQP (freeze weights, fine-tune quantization params) to restore accuracy under heavy compression.

Custom GPU inference engine with task-centric parallelism and GQSKernel to balance load and accelerate GEMV for decoding.

Key Findings

GQSA preserves accuracy better than heavy quantization or 2:4 pruning on evaluated models.

Numbersavg +5.4% acc vs OmniQuant W2 on LLaMA-2-7B (W4+S50%)

GQSA meaningfully reduces end-to-end latency and raises throughput on GPUs used in tests.

NumbersLLaMA-7B W4S50: 343.43 tps vs FP16 92.69 tps (≈3.7×); latency 1024 output reduced ~4× vs FP16

Combining sparsity and quantization gives better trade-offs than using either alone under extreme compression.

NumbersLLaMA-2-7B WikiText2 PPLs: W2 36.43, S60% 25.76, W4S50% 10.64

GQSA training overhead is modest for mid-size models.

NumbersBQPO + E2E-OQP on LLaMA-2-7B: memory 9.3→7.6 GB, time ≈5.1h and 4.2h respectively on A100-40GB

Results

Perplexity (WikiText2)

ValueW4S50% PPL 10.64 (LLaMA-2-7B)

BaselineW2 PPL 36.43; S50% PPL 14.56

Latency (ms) - LLaMA-7B, seqlen 1024

Value3110.54 ms (W4+S50%)

BaselineFP16 12561.82 ms; 2:4 pruning 4118.36 ms

Throughput (tokens/sec)

Value343.43 tps (LLaMA-7B W4S50%)

BaselineFP16 92.69 tps

Accuracy

Valueavg +5.4% (LLaMA-2-7B) and +5.7% (LLaMA-2-13B)

BaselineOmniQuant W2 per-group quantization

Who Should Care

What To Try In 7 Days

Benchmark your 7B/13B model: measure FP16 latency and memory on your target GPU.

Run GQSA W4+S30–50% on a single model copy with the provided BQPO+E2E-OQP recipe and 4k calibration samples.

Integrate GQSKernel into a FastTransformer path and compare tokens/sec and end-to-end latency.

Optimization Features

Token Efficiency

  • no changes to decoding token usage

Infra Optimization

  • compatible with A100/A800/RTX-class GPUs and FastTransformer integration

Model Optimization

  • per-group uniform INT4 quantization
  • group/BSR structured sparsity (row-wise groups)
  • Hessian-based salient-group selection

System Optimization

  • task-centric thread-block mapping for GEMV
  • dequantize-in-register then use TensorCores/FMA for compute

Training Optimization

  • BQPO: block-wise weight optimization (5 epochs typical)
  • E2E-OQP: freeze weights and fine-tune quantization params (2 epochs typical)

Inference Optimization

  • GQSKernel with task-centric parallelism
  • Stream-K style fine-grained CTA decomposition to avoid stragglers
  • grouped quantized storage (gguf) to reduce memory access

Reproducibility

Data Urls

  • WikiText2
  • C4

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Does not address activation quantization as primary focus.
  • Not yet evaluated on models above ~30B; paper notes no results >100B.
  • Requires custom kernel and GPU-friendly block format; gains depend on engine support.
  • Performance degrades past ~60% sparsity in experiments.

When Not To Use

  • You need exact FP16/FP32 parity for critical tasks.
  • You cannot modify inference engine or add custom kernels on target hardware.
  • Your model is >100B parameters and untested with GQSA.

Failure Modes

  • Accuracy collapse when structured sparsity exceeds ~60–80% in some models.
  • No engine support: sparse-block format may not accelerate on all GPUs.
  • Combining with very low-bit activation quantization may still harm accuracy.

Core Entities

Models

  • LLaMA
  • LLaMA-2
  • LLaMA-3
  • LLaMA-3.1
  • Qwen2.5
  • OPT

Metrics

  • perplexity
  • Accuracy
  • latency_ms
  • tokens_per_second
  • memory_GB

Datasets

  • WikiText2
  • C4

Benchmarks

  • PIQA
  • ARC
  • HellaSwag
  • Winogrande
  • lm-eval