Flash-LLM: run sparsified LLMs on tensor cores with up to ~3–3.8× real inference speedups and lower GPU cost

Overview

Decision SnapshotReady For Pilot

The work is implemented, integrated into FasterTransformer, and evaluated on real OPT models and A100 hardware, making it practically ready for teams that can prune models and run on NVIDIA tensor-core GPUs.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 65%

Authors

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song

Links

Abstract / PDF / Code

Why It Matters For Business

Flash-LLM cuts inference GPU cost and increases throughput for production LLM serving by enabling practical unstructured sparsity on tensor cores.

Who Should Care

ML Engineer Engineering Lead CTO Founder Product Manager

Summary TLDR

Flash-LLM is a GPU software library that makes unstructured-sparse LLM inference fast and cheaper by loading pruned weights in a compact sparse format into GPU memory, reconstructing dense tiles on-chip, and using tensor cores to compute. For common LLM MatMuls (skinny matrices) this reduces global memory bandwidth pressure and trades some extra on-chip work for much higher end-to-end throughput. Kernel-level tests show 2–3.6× speedups over state-of-the-art sparse libraries and up to ~1.4–2.1× vs dense cuBLAS at 70–90% sparsity. On full OPT models, Flash-LLM raises tokens per GPU-second up to 3.8× vs DeepSpeed and 3.6× vs FasterTransformer while using fewer GPUs. Code is available.

Problem Statement

Large generative models consume too much GPU memory and bandwidth for efficient single-GPU inference. Unstructured pruning keeps accuracy better than structured pruning but existing sparse kernels do not exploit tensor cores, so they often run slower than dense libraries unless sparsity is extreme (>90%). This makes cost-effective, low-latency inference of large LLMs hard in practice.

Main Contribution

Load-as-Sparse / Compute-as-Dense method: read weights in sparse form, expand to dense on-chip, compute on tensor cores to avoid global memory bottleneck.

Tiled-CSL sparse format plus ahead-of-time reordering to reduce shared-memory bank conflicts and enable tile-by-tile sparse-to-dense transforms.

Key Findings

Flash-LLM speeds kernel SpMM 2–3.6× over Sputnik and 1.4–1.6× over SparTA depending on sparsity.

Numbersavg 3.6×/1.4× at 70% sparsity; 3.0×/1.4× at 80%; 2.0×/1.6× at 90%

Practical UseUse Flash-LLM kernels to get much faster sparse MatMul on A100 GPUs at realistic pruning levels (70–90%).

Evidence RefAbstract; Sec.6.1; Fig.9

Flash-LLM can outperform dense cuBLAS on skinny LLM MatMuls because memory bandwidth is the bottleneck.

Numbersup to 2.1× vs cuBLAS at 90% sparsity; tensor-core utilization rose from ~10% to ~24–43%

Practical UseWhen batch size (N) is small and MatMuls are skinny, prefer Flash-LLM over dense kernels to increase throughput.

Evidence RefSec.3.2.1; Sec.6.2; Fig.10; Fig.11

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
kernel throughput vs Sputnik/SparTA	avg 3.6× / 1.4× at 70% sparsity; 3.0× /1.4× at 80%; 2.0× /1.6× at 90%	Sputnik / SparTA	—	Varied MatMul shapes from OPT models; batch sizes 8/16/32/64	Sec.6.1; Fig.9	Fig.9
kernel throughput vs cuBLAS dense	1.4×–2.1× (70–90% sparsity) on skinny MatMuls	cuBLAS (dense)	—	MatMul shapes from OPT models	Sec.6.1; Fig.9	Fig.9

What To Try In 7 Days

Prune a production model to ~80% sparsity using retraining-based pruning and validate accuracy on a representative task.

Build Tiled-CSL sparse weights and run Flash-LLM kernel on an A100 to measure tokens per GPU-second vs your current stack.

If throughput improves, swap to Flash-LLM-enabled serving to reduce GPU count and cross-GPU communication.

Optimization Features

Token Efficiency

Higher tokens per GPU-second by reducing global memory usage and cross-GPU comms

Infra Optimization

Optimized for NVIDIA Ampere A100 tensor cores; uses cp.async and tensor-core intrinsics

Model Optimization

Unstructured pruning (retraining-based, ~60–90% sparsity)

System Optimization

Improved tensor-core utilization by shifting memory pressure on-chipDouble-buffering and register-based sparse extraction

Training Optimization

Accuracy

Inference Optimization

Load-as-Sparse and Compute-as-Dense (sparse load, on-chip dense reconstruct)Tiled-CSL sparse format and ahead-of-time data reorderingTwo-level overlap pipeline (gmem2reg, extract, smem2tc) to hide latency

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/AlibabaResearch/flash-llm

Risks & Boundaries

Limitations

Requires retraining-based pruning to reach the targeted 60–90% sparsity with small accuracy loss.

Performance gains shrink when batch size (N) is large (N > ~256).

When Not To Use

When you cannot or will not perform retraining-based pruning.

For very large batch inference where N is large (>256) and dense kernels are faster.

Failure Modes

Over-pruning beyond tested ranges can harm accuracy; Flash-LLM assumes moderate sparsity (~60–90%).

Shared-memory bank conflicts or register exhaustion may negate speedups on some shapes.

Core Entities

Models

OPT-30BOPT-66BOPT-175BGPT-NeoX-20B

Metrics

tokens per GPU-secondkernel TFLOPskernel latencytensor core utilization

Datasets

SuperGLUE (RTE)

Benchmarks

kernel SpMM TFLOPstokens per GPU-second (end-to-end)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Flash-LLM speeds kernel SpMM 2–3.6× over Sputnik and 1.4–1.6× over SparTA depending on sparsity.

Flash-LLM can outperform dense cuBLAS on skinny LLM MatMuls because memory bandwidth is the bottleneck.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Measure many LLMs with only a few test items by learning weighted anchors

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

Key finding

Practical survey of how to combine fine-tuned LLMs into one model without retraining

Key finding