Run accuracy-preserving 6:8 sparse LLMs on current GPUs and get ~1.33× inference speed with no model changes

Overview

Production Readiness

0.75

Novelty Score

0.55

Cost Impact Score

0.65

Citation Count

Authors

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei

Links

Abstract / PDF

Why It Matters For Business

SlideSparse lets teams deploy accuracy-preserving sparsity patterns (e.g., 6:8) and gain real GPU acceleration on existing hardware, reducing latency and compute cost without retraining.

Summary TLDR

SlideSparse is a systems technique that makes milder structured sparsity patterns (the (2N-2):2N family like 6:8) run on existing NVIDIA Sparse Tensor Cores by decomposing each sparse block into overlapping 2:4 windows. It fuses the required activation rearrangement into per-token quantization, adding near-zero overhead. On real models and GPUs (A100/H100/B200/RTX) SlideSparse preserves accuracy (6:8 ~51.6% vs 54.0% dense on Qwen3) while delivering kernel and end-to-end speedups near the theoretical N/(N-1) bound (6:8 → ~1.33× end-to-end on A100 INT8). Code and conversion tools are provided.

Problem Statement

Hardware supports only rigid 2:4 (50%) N:M sparsity, which often breaks LLM accuracy. Practitioners prefer milder structured sparsity (e.g., 6:8, 25% pruning) that keeps accuracy but currently gets no hardware acceleration. The gap forces a trade-off: either accept big accuracy loss for 2× speed or keep accuracy with no speedup.

Main Contribution

Empirical gap: show 2:4 often collapses LLM reasoning accuracy while 6:8 preserves near-dense accuracy (Qwen3: 2:4 15.3% vs dense 54.0%; 6:8 51.6%).

Sliding Window Decomposition: a provably lossless transform that maps any (2N-2):2N block into N-1 overlapping 2:4 windows with optimal expansion γ=(2N-2)/N.

System design and implementation: offline packer + fused quantization-slide Triton kernel + cuSPARSELt backend integrated into vLLM.

Large-scale validation: kernel and end-to-end benchmarks across five precisions and six GPUs showing speedups near the theoretical N/(N-1) bound and efficiency often ≥100% versus native 2:4.

Key Findings

Milder structured sparsity (6:8) preserves reasoning accuracy while 2:4 destroys it.

NumbersQwen3 reasoning: dense 54.0% → 6:8 51.6% vs 2:4 15.3%

SlideSparse attains theoretical speedup limits for (2N-2):2N on Sparse Tensor Cores.

NumbersQwen2.5-7B 6:8 end-to-end on A100 (INT8) = 1.33× (theoretical N/(N-1))

Kernel-level gains carry into real serving with small extra cost.

NumbersKernel 6:8 on A100 INT8: 1.41×; E2E prefill 6:8 on A100 INT8: 1.29–1.34×

The activation rearrangement overhead is small when fused with quantization.

NumbersFused quant+slide overhead ∆ = 7–32 µs (29–53% longer than quant-only), but GEMM dominates so overhead <3% of total in 1

SlideSparse often exceeds expected performance vs native 2:4 because of baseline inefficiencies.

NumbersEfficiency >100% reported (e.g., A100 INT8 6:8: 115%; H100 and B200 also >100%)

Results

Accuracy

Valuedense 54.0%, 6:8 51.6%, 2:4 15.3%

Baselinedense

End-to-end speedup (Qwen2.5-7B, A100, INT8, prefill)

Value1.33×

Baselinedense cuBLASLt

Kernel speedup (6:8, A100 INT8, M=16384)

Value1.41–1.42×

Baselinedense cuBLASLt

Fused quant+slide overhead (per-row)

Valueabsolute ∆ = 7–32 µs; relative overhead 29–53% vs quant-only

Baselinequant-only kernel

Algorithmic efficiency vs native 2:4

Valueoften ≥100% (e.g., A100 INT8 6:8 = 115%; B200 higher due to baseline)

BaselinecuSPARSELt 2:4

Who Should Care

CtoProduct ManagerMl EngineerEngineering LeadFounder

What To Try In 7 Days

Convert one large model checkpoint to (2N-2):2N masks (e.g., 6:8) using the offline packer and test accuracy vs dense.

Enable SlideSparse backend in vLLM and run end-to-end prefill on a representative workload (M≥4096) to measure throughput change.

Profile quant+slide overhead vs GEMM on your target GPU to confirm break-even M and expected speedup.

Optimization Features

Infra Optimization

works across A100/H100/B200/RTX GPUs and multiple precisions (INT8/FP8/BF16/FP16/FP4)

Model Optimization

structured (2N-2):2N pruning (e.g., 6:8, 4:6)
lossless sliding-window decomposition for weight layout

System Optimization

offline weight packer (PyTorch/CUDA) to prepare model at load time
Triton fused kernels to reduce memory traffic
minimal vLLM integration via quantization interface

Training Optimization

paper uses post-hoc magnitude pruning; notes sparse-aware training could help

Inference Optimization

convert (2N-2):2N blocks into overlapping 2:4 windows
fused quantization + activation lifting kernel to avoid extra memory passes
use cuSPARSELt for 2:4 sparse GEMM

Reproducibility

Code Urls

https://github.com/bcacdwk/vllmbench

Code Available

Open Source Status

partial

Risks & Boundaries

Limitations

Evaluated on post-hoc magnitude-pruned models; sparse-aware training might be needed for higher sparsity.
Small M (batch×seq) workloads often do not benefit because sparse kernel overhead dominates (M < ~256).
Some GPUs/drivers show precision or API gaps (FP16/FP4 errors, RTX irregularities, B200 INT8 baseline anomalies).
Transformation expands K-dimension by γ=(2N-2)/N, increasing temporary memory writes during quantization.

When Not To Use

Very low M workloads (small batches or single-token short contexts) where GEMM is not compute-bound.
On GPU platforms where cuSPARSELt or drivers lack stable sparse precision support for your precision.
When you rely on a pruning recipe that must remain strictly 2:4 (and you accept its accuracy loss).

Failure Modes

Driver or library bugs (cuSPARSELt/cuBLASLt/Triton) can produce large performance variance or errors.
If quantization/packing fails, the fused kernel can raise illegal-address or index overflow on extreme shapes.
If the baseline dense kernels are suboptimal or change with driver updates, apparent speedups may shrink.

Core Entities

Models

Qwen3
Qwen2.5-7B
Qwen2.5-14B
Llama-3.2-1B
Llama-3.2-3B
BitNet-1.58-2B

Metrics

speedup ratio vs dense (cuBLASLt)
efficiency vs 2:4 (cuSPARSELt)
Accuracy

Datasets

reasoning benchmarks (aggregate used in Qwen3 eval)

Benchmarks

kernel GEMM speedup (various M)
end-to-end prefill throughput
end-to-end decode throughput