Run accuracy-preserving 6:8 sparse LLMs on current GPUs and get ~1.33× inference speed with no model changes

March 5, 20268 min

Overview

Production Readiness

0.75

Novelty Score

0.55

Cost Impact Score

0.65

Citation Count

0

Authors

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei

Links

Abstract / PDF

Why It Matters For Business

SlideSparse lets teams deploy accuracy-preserving sparsity patterns (e.g., 6:8) and gain real GPU acceleration on existing hardware, reducing latency and compute cost without retraining.

Summary TLDR

SlideSparse is a systems technique that makes milder structured sparsity patterns (the (2N-2):2N family like 6:8) run on existing NVIDIA Sparse Tensor Cores by decomposing each sparse block into overlapping 2:4 windows. It fuses the required activation rearrangement into per-token quantization, adding near-zero overhead. On real models and GPUs (A100/H100/B200/RTX) SlideSparse preserves accuracy (6:8 ~51.6% vs 54.0% dense on Qwen3) while delivering kernel and end-to-end speedups near the theoretical N/(N-1) bound (6:8 → ~1.33× end-to-end on A100 INT8). Code and conversion tools are provided.

Problem Statement

Hardware supports only rigid 2:4 (50%) N:M sparsity, which often breaks LLM accuracy. Practitioners prefer milder structured sparsity (e.g., 6:8, 25% pruning) that keeps accuracy but currently gets no hardware acceleration. The gap forces a trade-off: either accept big accuracy loss for 2× speed or keep accuracy with no speedup.

Main Contribution

Empirical gap: show 2:4 often collapses LLM reasoning accuracy while 6:8 preserves near-dense accuracy (Qwen3: 2:4 15.3% vs dense 54.0%; 6:8 51.6%).

Sliding Window Decomposition: a provably lossless transform that maps any (2N-2):2N block into N-1 overlapping 2:4 windows with optimal expansion γ=(2N-2)/N.

System design and implementation: offline packer + fused quantization-slide Triton kernel + cuSPARSELt backend integrated into vLLM.

Large-scale validation: kernel and end-to-end benchmarks across five precisions and six GPUs showing speedups near the theoretical N/(N-1) bound and efficiency often ≥100% versus native 2:4.

Key Findings

Milder structured sparsity (6:8) preserves reasoning accuracy while 2:4 destroys it.

NumbersQwen3 reasoning: dense 54.0% → 6:8 51.6% vs 2:4 15.3%

SlideSparse attains theoretical speedup limits for (2N-2):2N on Sparse Tensor Cores.

NumbersQwen2.5-7B 6:8 end-to-end on A100 (INT8) = 1.33× (theoretical N/(N-1))

Kernel-level gains carry into real serving with small extra cost.

NumbersKernel 6:8 on A100 INT8: 1.41×; E2E prefill 6:8 on A100 INT8: 1.29–1.34×

The activation rearrangement overhead is small when fused with quantization.

NumbersFused quant+slide overhead ∆ = 7–32 µs (29–53% longer than quant-only), but GEMM dominates so overhead <3% of total in 1

SlideSparse often exceeds expected performance vs native 2:4 because of baseline inefficiencies.

NumbersEfficiency >100% reported (e.g., A100 INT8 6:8: 115%; H100 and B200 also >100%)

Results

Accuracy

Valuedense 54.0%, 6:8 51.6%, 2:4 15.3%

Baselinedense

End-to-end speedup (Qwen2.5-7B, A100, INT8, prefill)

Value1.33×

Baselinedense cuBLASLt

Kernel speedup (6:8, A100 INT8, M=16384)

Value1.41–1.42×

Baselinedense cuBLASLt

Fused quant+slide overhead (per-row)

Valueabsolute ∆ = 7–32 µs; relative overhead 29–53% vs quant-only

Baselinequant-only kernel

Algorithmic efficiency vs native 2:4

Valueoften ≥100% (e.g., A100 INT8 6:8 = 115%; B200 higher due to baseline)

BaselinecuSPARSELt 2:4

Who Should Care

What To Try In 7 Days

Convert one large model checkpoint to (2N-2):2N masks (e.g., 6:8) using the offline packer and test accuracy vs dense.

Enable SlideSparse backend in vLLM and run end-to-end prefill on a representative workload (M≥4096) to measure throughput change.

Profile quant+slide overhead vs GEMM on your target GPU to confirm break-even M and expected speedup.

Optimization Features

Infra Optimization

  • works across A100/H100/B200/RTX GPUs and multiple precisions (INT8/FP8/BF16/FP16/FP4)

Model Optimization

  • structured (2N-2):2N pruning (e.g., 6:8, 4:6)
  • lossless sliding-window decomposition for weight layout

System Optimization

  • offline weight packer (PyTorch/CUDA) to prepare model at load time
  • Triton fused kernels to reduce memory traffic
  • minimal vLLM integration via quantization interface

Training Optimization

  • paper uses post-hoc magnitude pruning; notes sparse-aware training could help

Inference Optimization

  • convert (2N-2):2N blocks into overlapping 2:4 windows
  • fused quantization + activation lifting kernel to avoid extra memory passes
  • use cuSPARSELt for 2:4 sparse GEMM

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluated on post-hoc magnitude-pruned models; sparse-aware training might be needed for higher sparsity.
  • Small M (batch×seq) workloads often do not benefit because sparse kernel overhead dominates (M < ~256).
  • Some GPUs/drivers show precision or API gaps (FP16/FP4 errors, RTX irregularities, B200 INT8 baseline anomalies).
  • Transformation expands K-dimension by γ=(2N-2)/N, increasing temporary memory writes during quantization.

When Not To Use

  • Very low M workloads (small batches or single-token short contexts) where GEMM is not compute-bound.
  • On GPU platforms where cuSPARSELt or drivers lack stable sparse precision support for your precision.
  • When you rely on a pruning recipe that must remain strictly 2:4 (and you accept its accuracy loss).

Failure Modes

  • Driver or library bugs (cuSPARSELt/cuBLASLt/Triton) can produce large performance variance or errors.
  • If quantization/packing fails, the fused kernel can raise illegal-address or index overflow on extreme shapes.
  • If the baseline dense kernels are suboptimal or change with driver updates, apparent speedups may shrink.

Core Entities

Models

  • Qwen3
  • Qwen2.5-7B
  • Qwen2.5-14B
  • Llama-3.2-1B
  • Llama-3.2-3B
  • BitNet-1.58-2B

Metrics

  • speedup ratio vs dense (cuBLASLt)
  • efficiency vs 2:4 (cuSPARSELt)
  • Accuracy

Datasets

  • reasoning benchmarks (aggregate used in Qwen3 eval)

Benchmarks

  • kernel GEMM speedup (various M)
  • end-to-end prefill throughput
  • end-to-end decode throughput