Overview
GQSA pairs known ideas (group quantization + structured sparsity) with a new two-stage optimizer and a custom kernel; experiments across LLaMA/Qwen/OPT show consistent speed and memory benefits, but GPU kernel integration and >100B scaling are untested.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
GQSA delivers multi× inference speedups and big memory savings for 7B–30B class LLMs while often preserving or improving zero-shot accuracy, enabling cheaper, faster serving on GPUs and more viable edge deployment.
Who Should Care
Summary TLDR
GQSA is a compression + engine co-design that pairs per-group weight quantization (INT4) with structured group sparsity and a custom GPU kernel. It uses two optimization stages (block-wise BQPO then global E2E-OQP tuning of quant params) and a task-centric parallel engine. On LLaMA / Qwen / OPT family tests, GQSA (W4 + 40–50% sparsity) often matches or beats W2 and 2:4 baselines in accuracy while cutting latency and increasing throughput (examples: ~3.7× tokens/sec vs FP16 and up to ~4× latency reduction vs FP16 on LLaMA-7B in some settings).
Problem Statement
Decoding in LLMs (many GEMV operations) is the inference bottleneck. Existing semi-structured sparsity (2:4) is tied to NVIDIA TensorCore shapes and incompatible with weight-only quantization, limiting speedups and memory gains. There is a need for a sparsity+quantization scheme that: accelerates GEMV, works with weight-only per-group quantization, and balances accuracy vs compression.
Main Contribution
Introduce GQSA: group (row-wise) sparsity compatible with weight-only per-group quantization and stored in BSR block format.
Two-stage optimization: BQPO (block-wise weight recovery) and E2E-OQP (freeze weights, fine-tune quantization params) to restore accuracy under heavy compression.
Key Findings
GQSA preserves accuracy better than heavy quantization or 2:4 pruning on evaluated models.
GQSA meaningfully reduces end-to-end latency and raises throughput on GPUs used in tests.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Perplexity (WikiText2) | W4S50% PPL 10.64 (LLaMA-2-7B) | W2 PPL 36.43; S50% PPL 14.56 | W4S50 much lower (better) than W2 and better than S50% | WikiText2 / LLaMA-2-7B | Table 10 | Table 10 |
| Latency (ms) - LLaMA-7B, seqlen 1024 | 3110.54 ms (W4+S50%) | FP16 12561.82 ms; 2:4 pruning 4118.36 ms | ≈4.0× faster than FP16; ≈1.32× faster than 2:4 in this setting | FastTransformer / A800-40GB | Table 16 and Table 4 | Table 16, Table 4 |
What To Try In 7 Days
Benchmark your 7B/13B model: measure FP16 latency and memory on your target GPU.
Run GQSA W4+S30–50% on a single model copy with the provided BQPO+E2E-OQP recipe and 4k calibration samples.
Integrate GQSKernel into a FastTransformer path and compare tokens/sec and end-to-end latency.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Does not address activation quantization as primary focus.
Not yet evaluated on models above ~30B; paper notes no results >100B.
When Not To Use
You need exact FP16/FP32 parity for critical tasks.
You cannot modify inference engine or add custom kernels on target hardware.
Failure Modes
Accuracy collapse when structured sparsity exceeds ~60–80% in some models.
No engine support: sparse-block format may not accelerate on all GPUs.

