Combine per-group 4-bit quantization with GPU-friendly group sparsity to speed LLM decoding with small accuracy loss.

December 23, 20247 min

Overview

Decision SnapshotReady For Pilot

GQSA pairs known ideas (group quantization + structured sparsity) with a new two-stage optimizer and a custom kernel; experiments across LLaMA/Qwen/OPT show consistent speed and memory benefits, but GPU kernel integration and >100B scaling are untested.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Lean Fu, Xing Mei

Links

Abstract / PDF / Data

Why It Matters For Business

GQSA delivers multi× inference speedups and big memory savings for 7B–30B class LLMs while often preserving or improving zero-shot accuracy, enabling cheaper, faster serving on GPUs and more viable edge deployment.

Who Should Care

Summary TLDR

GQSA is a compression + engine co-design that pairs per-group weight quantization (INT4) with structured group sparsity and a custom GPU kernel. It uses two optimization stages (block-wise BQPO then global E2E-OQP tuning of quant params) and a task-centric parallel engine. On LLaMA / Qwen / OPT family tests, GQSA (W4 + 40–50% sparsity) often matches or beats W2 and 2:4 baselines in accuracy while cutting latency and increasing throughput (examples: ~3.7× tokens/sec vs FP16 and up to ~4× latency reduction vs FP16 on LLaMA-7B in some settings).

Problem Statement

Decoding in LLMs (many GEMV operations) is the inference bottleneck. Existing semi-structured sparsity (2:4) is tied to NVIDIA TensorCore shapes and incompatible with weight-only quantization, limiting speedups and memory gains. There is a need for a sparsity+quantization scheme that: accelerates GEMV, works with weight-only per-group quantization, and balances accuracy vs compression.

Main Contribution

Introduce GQSA: group (row-wise) sparsity compatible with weight-only per-group quantization and stored in BSR block format.

Two-stage optimization: BQPO (block-wise weight recovery) and E2E-OQP (freeze weights, fine-tune quantization params) to restore accuracy under heavy compression.

Key Findings

GQSA preserves accuracy better than heavy quantization or 2:4 pruning on evaluated models.

Numbersavg +5.4% acc vs OmniQuant W2 on LLaMA-2-7B (W4+S50%)

Practical UseYou can reach higher compression (W4 + 40–50% sparsity) while matching or improving zero-shot accuracy relative to some W2 baselines on LLaMA-family models.

Evidence RefSection 4.3 / Table 3

GQSA meaningfully reduces end-to-end latency and raises throughput on GPUs used in tests.

NumbersLLaMA-7B W4S50: 343.43 tps vs FP16 92.69 tps (≈3.7×); latency 1024 output reduced ~4× vs FP16

Practical UseExpect multi× inference speedups and lower memory use when serving mid-size LLMs with the GQSKernel and FastTransformer integration.

Evidence RefSection 4.4 / Table 13 and Table 16

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Perplexity (WikiText2)W4S50% PPL 10.64 (LLaMA-2-7B)W2 PPL 36.43; S50% PPL 14.56W4S50 much lower (better) than W2 and better than S50%WikiText2 / LLaMA-2-7BTable 10Table 10
Latency (ms) - LLaMA-7B, seqlen 10243110.54 ms (W4+S50%)FP16 12561.82 ms; 2:4 pruning 4118.36 ms≈4.0× faster than FP16; ≈1.32× faster than 2:4 in this settingFastTransformer / A800-40GBTable 16 and Table 4Table 16, Table 4

What To Try In 7 Days

Benchmark your 7B/13B model: measure FP16 latency and memory on your target GPU.

Run GQSA W4+S30–50% on a single model copy with the provided BQPO+E2E-OQP recipe and 4k calibration samples.

Integrate GQSKernel into a FastTransformer path and compare tokens/sec and end-to-end latency.

Optimization Features

Token Efficiency
no changes to decoding token usage
Infra Optimization
compatible with A100/A800/RTX-class GPUs and FastTransformer integration
Model Optimization
per-group uniform INT4 quantizationgroup/BSR structured sparsity (row-wise groups)Hessian-based salient-group selection
System Optimization
task-centric thread-block mapping for GEMVdequantize-in-register then use TensorCores/FMA for compute
Training Optimization
BQPO: block-wise weight optimization (5 epochs typical)E2E-OQP: freeze weights and fine-tune quantization params (2 epochs typical)
Inference Optimization
GQSKernel with task-centric parallelismStream-K style fine-grained CTA decomposition to avoid stragglersgrouped quantized storage (gguf) to reduce memory access

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

WikiText2C4

Risks & Boundaries

Limitations

Does not address activation quantization as primary focus.

Not yet evaluated on models above ~30B; paper notes no results >100B.

When Not To Use

You need exact FP16/FP32 parity for critical tasks.

You cannot modify inference engine or add custom kernels on target hardware.

Failure Modes

Accuracy collapse when structured sparsity exceeds ~60–80% in some models.

No engine support: sparse-block format may not accelerate on all GPUs.

Core Entities

Models

LLaMALLaMA-2LLaMA-3LLaMA-3.1Qwen2.5OPT

Metrics

perplexityAccuracylatency_mstokens_per_secondmemory_GB

Datasets

WikiText2C4

Benchmarks

PIQAARCHellaSwagWinograndelm-eval