Combine per-group 4-bit quantization with GPU-friendly group sparsity to speed LLM decoding with small accuracy loss.

Overview

Decision SnapshotReady For Pilot

GQSA pairs known ideas (group quantization + structured sparsity) with a new two-stage optimizer and a custom kernel; experiments across LLaMA/Qwen/OPT show consistent speed and memory benefits, but GPU kernel integration and >100B scaling are untested.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Lean Fu, Xing Mei

Links

Abstract / PDF / Data

Why It Matters For Business

GQSA delivers multi× inference speedups and big memory savings for 7B–30B class LLMs while often preserving or improving zero-shot accuracy, enabling cheaper, faster serving on GPUs and more viable edge deployment.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Founder

Summary TLDR

GQSA is a compression + engine co-design that pairs per-group weight quantization (INT4) with structured group sparsity and a custom GPU kernel. It uses two optimization stages (block-wise BQPO then global E2E-OQP tuning of quant params) and a task-centric parallel engine. On LLaMA / Qwen / OPT family tests, GQSA (W4 + 40–50% sparsity) often matches or beats W2 and 2:4 baselines in accuracy while cutting latency and increasing throughput (examples: ~3.7× tokens/sec vs FP16 and up to ~4× latency reduction vs FP16 on LLaMA-7B in some settings).

Problem Statement

Decoding in LLMs (many GEMV operations) is the inference bottleneck. Existing semi-structured sparsity (2:4) is tied to NVIDIA TensorCore shapes and incompatible with weight-only quantization, limiting speedups and memory gains. There is a need for a sparsity+quantization scheme that: accelerates GEMV, works with weight-only per-group quantization, and balances accuracy vs compression.

Main Contribution

Introduce GQSA: group (row-wise) sparsity compatible with weight-only per-group quantization and stored in BSR block format.

Two-stage optimization: BQPO (block-wise weight recovery) and E2E-OQP (freeze weights, fine-tune quantization params) to restore accuracy under heavy compression.

Key Findings

GQSA preserves accuracy better than heavy quantization or 2:4 pruning on evaluated models.

Numbersavg +5.4% acc vs OmniQuant W2 on LLaMA-2-7B (W4+S50%)

Practical UseYou can reach higher compression (W4 + 40–50% sparsity) while matching or improving zero-shot accuracy relative to some W2 baselines on LLaMA-family models.

Evidence RefSection 4.3 / Table 3

GQSA meaningfully reduces end-to-end latency and raises throughput on GPUs used in tests.

NumbersLLaMA-7B W4S50: 343.43 tps vs FP16 92.69 tps (≈3.7×); latency 1024 output reduced ~4× vs FP16

Practical UseExpect multi× inference speedups and lower memory use when serving mid-size LLMs with the GQSKernel and FastTransformer integration.

Evidence RefSection 4.4 / Table 13 and Table 16

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Perplexity (WikiText2)	W4S50% PPL 10.64 (LLaMA-2-7B)	W2 PPL 36.43; S50% PPL 14.56	W4S50 much lower (better) than W2 and better than S50%	WikiText2 / LLaMA-2-7B	Table 10	Table 10
Latency (ms) - LLaMA-7B, seqlen 1024	3110.54 ms (W4+S50%)	FP16 12561.82 ms; 2:4 pruning 4118.36 ms	≈4.0× faster than FP16; ≈1.32× faster than 2:4 in this setting	FastTransformer / A800-40GB	Table 16 and Table 4	Table 16, Table 4

What To Try In 7 Days

Benchmark your 7B/13B model: measure FP16 latency and memory on your target GPU.

Run GQSA W4+S30–50% on a single model copy with the provided BQPO+E2E-OQP recipe and 4k calibration samples.

Integrate GQSKernel into a FastTransformer path and compare tokens/sec and end-to-end latency.

Optimization Features

Token Efficiency

no changes to decoding token usage

Infra Optimization

compatible with A100/A800/RTX-class GPUs and FastTransformer integration

Model Optimization

per-group uniform INT4 quantizationgroup/BSR structured sparsity (row-wise groups)Hessian-based salient-group selection

System Optimization

task-centric thread-block mapping for GEMVdequantize-in-register then use TensorCores/FMA for compute

Training Optimization

BQPO: block-wise weight optimization (5 epochs typical)E2E-OQP: freeze weights and fine-tune quantization params (2 epochs typical)

Inference Optimization

GQSKernel with task-centric parallelismStream-K style fine-grained CTA decomposition to avoid stragglersgrouped quantized storage (gguf) to reduce memory access

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

WikiText2C4

Risks & Boundaries

Limitations

Does not address activation quantization as primary focus.

Not yet evaluated on models above ~30B; paper notes no results >100B.

When Not To Use

You need exact FP16/FP32 parity for critical tasks.

You cannot modify inference engine or add custom kernels on target hardware.

Failure Modes

Accuracy collapse when structured sparsity exceeds ~60–80% in some models.

No engine support: sparse-block format may not accelerate on all GPUs.

Core Entities

Models

LLaMALLaMA-2LLaMA-3LLaMA-3.1Qwen2.5OPT

Metrics

perplexityAccuracylatency_mstokens_per_secondmemory_GB

Datasets

WikiText2C4

Benchmarks

PIQAARCHellaSwagWinograndelm-eval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GQSA preserves accuracy better than heavy quantization or 2:4 pruning on evaluated models.

GQSA meaningfully reduces end-to-end latency and raises throughput on GPUs used in tests.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding