Flash-LLM: run sparsified LLMs on tensor cores with up to ~3–3.8× real inference speedups and lower GPU cost

September 19, 20238 min

Overview

Decision SnapshotReady For Pilot

The work is implemented, integrated into FasterTransformer, and evaluated on real OPT models and A100 hardware, making it practically ready for teams that can prune models and run on NVIDIA tensor-core GPUs.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 65%

Authors

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song

Links

Abstract / PDF / Code

Why It Matters For Business

Flash-LLM cuts inference GPU cost and increases throughput for production LLM serving by enabling practical unstructured sparsity on tensor cores.

Who Should Care

Summary TLDR

Flash-LLM is a GPU software library that makes unstructured-sparse LLM inference fast and cheaper by loading pruned weights in a compact sparse format into GPU memory, reconstructing dense tiles on-chip, and using tensor cores to compute. For common LLM MatMuls (skinny matrices) this reduces global memory bandwidth pressure and trades some extra on-chip work for much higher end-to-end throughput. Kernel-level tests show 2–3.6× speedups over state-of-the-art sparse libraries and up to ~1.4–2.1× vs dense cuBLAS at 70–90% sparsity. On full OPT models, Flash-LLM raises tokens per GPU-second up to 3.8× vs DeepSpeed and 3.6× vs FasterTransformer while using fewer GPUs. Code is available.

Problem Statement

Large generative models consume too much GPU memory and bandwidth for efficient single-GPU inference. Unstructured pruning keeps accuracy better than structured pruning but existing sparse kernels do not exploit tensor cores, so they often run slower than dense libraries unless sparsity is extreme (>90%). This makes cost-effective, low-latency inference of large LLMs hard in practice.

Main Contribution

Load-as-Sparse / Compute-as-Dense method: read weights in sparse form, expand to dense on-chip, compute on tensor cores to avoid global memory bottleneck.

Tiled-CSL sparse format plus ahead-of-time reordering to reduce shared-memory bank conflicts and enable tile-by-tile sparse-to-dense transforms.

Key Findings

Flash-LLM speeds kernel SpMM 2–3.6× over Sputnik and 1.4–1.6× over SparTA depending on sparsity.

Numbersavg 3.6×/1.4× at 70% sparsity; 3.0×/1.4× at 80%; 2.0×/1.6× at 90%

Practical UseUse Flash-LLM kernels to get much faster sparse MatMul on A100 GPUs at realistic pruning levels (70–90%).

Evidence RefAbstract; Sec.6.1; Fig.9

Flash-LLM can outperform dense cuBLAS on skinny LLM MatMuls because memory bandwidth is the bottleneck.

Numbersup to 2.1× vs cuBLAS at 90% sparsity; tensor-core utilization rose from ~10% to ~2443%

Practical UseWhen batch size (N) is small and MatMuls are skinny, prefer Flash-LLM over dense kernels to increase throughput.

Evidence RefSec.3.2.1; Sec.6.2; Fig.10; Fig.11

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
kernel throughput vs Sputnik/SparTAavg 3.6× / 1.4× at 70% sparsity; 3.0× /1.4× at 80%; 2.0× /1.6× at 90%Sputnik / SparTAVaried MatMul shapes from OPT models; batch sizes 8/16/32/64Sec.6.1; Fig.9Fig.9
kernel throughput vs cuBLAS dense1.4×–2.1× (7090% sparsity) on skinny MatMulscuBLAS (dense)MatMul shapes from OPT modelsSec.6.1; Fig.9Fig.9

What To Try In 7 Days

Prune a production model to ~80% sparsity using retraining-based pruning and validate accuracy on a representative task.

Build Tiled-CSL sparse weights and run Flash-LLM kernel on an A100 to measure tokens per GPU-second vs your current stack.

If throughput improves, swap to Flash-LLM-enabled serving to reduce GPU count and cross-GPU communication.

Optimization Features

Token Efficiency
Higher tokens per GPU-second by reducing global memory usage and cross-GPU comms
Infra Optimization
Optimized for NVIDIA Ampere A100 tensor cores; uses cp.async and tensor-core intrinsics
Model Optimization
Unstructured pruning (retraining-based, ~60–90% sparsity)
System Optimization
Improved tensor-core utilization by shifting memory pressure on-chipDouble-buffering and register-based sparse extraction
Training Optimization
Accuracy
Inference Optimization
Load-as-Sparse and Compute-as-Dense (sparse load, on-chip dense reconstruct)Tiled-CSL sparse format and ahead-of-time data reorderingTwo-level overlap pipeline (gmem2reg, extract, smem2tc) to hide latency

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Requires retraining-based pruning to reach the targeted 60–90% sparsity with small accuracy loss.

Performance gains shrink when batch size (N) is large (N > ~256).

When Not To Use

When you cannot or will not perform retraining-based pruning.

For very large batch inference where N is large (>256) and dense kernels are faster.

Failure Modes

Over-pruning beyond tested ranges can harm accuracy; Flash-LLM assumes moderate sparsity (~60–90%).

Shared-memory bank conflicts or register exhaustion may negate speedups on some shapes.

Core Entities

Models

OPT-30BOPT-66BOPT-175BGPT-NeoX-20B

Metrics

tokens per GPU-secondkernel TFLOPskernel latencytensor core utilization

Datasets

SuperGLUE (RTE)

Benchmarks

kernel SpMM TFLOPstokens per GPU-second (end-to-end)