Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
GQSA delivers multi× inference speedups and big memory savings for 7B–30B class LLMs while often preserving or improving zero-shot accuracy, enabling cheaper, faster serving on GPUs and more viable edge deployment.
Summary TLDR
GQSA is a compression + engine co-design that pairs per-group weight quantization (INT4) with structured group sparsity and a custom GPU kernel. It uses two optimization stages (block-wise BQPO then global E2E-OQP tuning of quant params) and a task-centric parallel engine. On LLaMA / Qwen / OPT family tests, GQSA (W4 + 40–50% sparsity) often matches or beats W2 and 2:4 baselines in accuracy while cutting latency and increasing throughput (examples: ~3.7× tokens/sec vs FP16 and up to ~4× latency reduction vs FP16 on LLaMA-7B in some settings).
Problem Statement
Decoding in LLMs (many GEMV operations) is the inference bottleneck. Existing semi-structured sparsity (2:4) is tied to NVIDIA TensorCore shapes and incompatible with weight-only quantization, limiting speedups and memory gains. There is a need for a sparsity+quantization scheme that: accelerates GEMV, works with weight-only per-group quantization, and balances accuracy vs compression.
Main Contribution
Introduce GQSA: group (row-wise) sparsity compatible with weight-only per-group quantization and stored in BSR block format.
Two-stage optimization: BQPO (block-wise weight recovery) and E2E-OQP (freeze weights, fine-tune quantization params) to restore accuracy under heavy compression.
Custom GPU inference engine with task-centric parallelism and GQSKernel to balance load and accelerate GEMV for decoding.
Key Findings
GQSA preserves accuracy better than heavy quantization or 2:4 pruning on evaluated models.
GQSA meaningfully reduces end-to-end latency and raises throughput on GPUs used in tests.
Combining sparsity and quantization gives better trade-offs than using either alone under extreme compression.
GQSA training overhead is modest for mid-size models.
Results
Perplexity (WikiText2)
Latency (ms) - LLaMA-7B, seqlen 1024
Throughput (tokens/sec)
Accuracy
Who Should Care
What To Try In 7 Days
Benchmark your 7B/13B model: measure FP16 latency and memory on your target GPU.
Run GQSA W4+S30–50% on a single model copy with the provided BQPO+E2E-OQP recipe and 4k calibration samples.
Integrate GQSKernel into a FastTransformer path and compare tokens/sec and end-to-end latency.
Optimization Features
Token Efficiency
- no changes to decoding token usage
Infra Optimization
- compatible with A100/A800/RTX-class GPUs and FastTransformer integration
Model Optimization
- per-group uniform INT4 quantization
- group/BSR structured sparsity (row-wise groups)
- Hessian-based salient-group selection
System Optimization
- task-centric thread-block mapping for GEMV
- dequantize-in-register then use TensorCores/FMA for compute
Training Optimization
- BQPO: block-wise weight optimization (5 epochs typical)
- E2E-OQP: freeze weights and fine-tune quantization params (2 epochs typical)
Inference Optimization
- GQSKernel with task-centric parallelism
- Stream-K style fine-grained CTA decomposition to avoid stragglers
- grouped quantized storage (gguf) to reduce memory access
Reproducibility
Data Urls
- WikiText2
- C4
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Does not address activation quantization as primary focus.
- Not yet evaluated on models above ~30B; paper notes no results >100B.
- Requires custom kernel and GPU-friendly block format; gains depend on engine support.
- Performance degrades past ~60% sparsity in experiments.
When Not To Use
- You need exact FP16/FP32 parity for critical tasks.
- You cannot modify inference engine or add custom kernels on target hardware.
- Your model is >100B parameters and untested with GQSA.
Failure Modes
- Accuracy collapse when structured sparsity exceeds ~60–80% in some models.
- No engine support: sparse-block format may not accelerate on all GPUs.
- Combining with very low-bit activation quantization may still harm accuracy.
Core Entities
Models
- LLaMA
- LLaMA-2
- LLaMA-3
- LLaMA-3.1
- Qwen2.5
- OPT
Metrics
- perplexity
- Accuracy
- latency_ms
- tokens_per_second
- memory_GB
Datasets
- WikiText2
- C4
Benchmarks
- PIQA
- ARC
- HellaSwag
- Winogrande
- lm-eval

