Flash-LLM: run sparsified LLMs on tensor cores with up to ~3–3.8× real inference speedups and lower GPU cost

September 19, 20238 min

Overview

Production Readiness

0.8

Novelty Score

0.65

Cost Impact Score

0.8

Citation Count

1

Authors

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song

Links

Abstract / PDF

Why It Matters For Business

Flash-LLM cuts inference GPU cost and increases throughput for production LLM serving by enabling practical unstructured sparsity on tensor cores.

Summary TLDR

Flash-LLM is a GPU software library that makes unstructured-sparse LLM inference fast and cheaper by loading pruned weights in a compact sparse format into GPU memory, reconstructing dense tiles on-chip, and using tensor cores to compute. For common LLM MatMuls (skinny matrices) this reduces global memory bandwidth pressure and trades some extra on-chip work for much higher end-to-end throughput. Kernel-level tests show 2–3.6× speedups over state-of-the-art sparse libraries and up to ~1.4–2.1× vs dense cuBLAS at 70–90% sparsity. On full OPT models, Flash-LLM raises tokens per GPU-second up to 3.8× vs DeepSpeed and 3.6× vs FasterTransformer while using fewer GPUs. Code is available.

Problem Statement

Large generative models consume too much GPU memory and bandwidth for efficient single-GPU inference. Unstructured pruning keeps accuracy better than structured pruning but existing sparse kernels do not exploit tensor cores, so they often run slower than dense libraries unless sparsity is extreme (>90%). This makes cost-effective, low-latency inference of large LLMs hard in practice.

Main Contribution

Load-as-Sparse / Compute-as-Dense method: read weights in sparse form, expand to dense on-chip, compute on tensor cores to avoid global memory bottleneck.

Tiled-CSL sparse format plus ahead-of-time reordering to reduce shared-memory bank conflicts and enable tile-by-tile sparse-to-dense transforms.

A pipelined GPU kernel that overlaps global loads, sparse-to-dense extraction, and tensor-core MMA to speed Skinny MatMuls common in LLMs.

Integration into FasterTransformer and public open-source release for practical deployment.

Key Findings

Flash-LLM speeds kernel SpMM 2–3.6× over Sputnik and 1.4–1.6× over SparTA depending on sparsity.

Numbersavg 3.6×/1.4× at 70% sparsity; 3.0×/1.4× at 80%; 2.0×/1.6× at 90%

Flash-LLM can outperform dense cuBLAS on skinny LLM MatMuls because memory bandwidth is the bottleneck.

Numbersup to 2.1× vs cuBLAS at 90% sparsity; tensor-core utilization rose from ~10% to ~24–43%

End-to-end OPT model inference throughput improves by up to ~3.8× vs DeepSpeed and ~3.6× vs FasterTransformer while using fewer GPUs.

NumbersOPT-30B/66B/175B: up to 3.8× (tokens/GPU-s) vs DeepSpeed and 3.6× vs FasterTransformer

Retraining-based unstructured pruning preserves model quality at 80% sparsity with small accuracy loss.

NumbersOPT-30B accuracy 85.55% → 84.11% (−1.44%) on SuperGLUE RTE; GPT-NeoX-20B −0.72%

Flash-LLM becomes slower than dense cuBLAS when N (batch dimension) grows large (>256) because compute inefficiency disappears.

Numbersslower than cuBLAS for N > 256 (Fig.12)

Results

kernel throughput vs Sputnik/SparTA

Valueavg 3.6× / 1.4× at 70% sparsity; 3.0× /1.4× at 80%; 2.0× /1.6× at 90%

BaselineSputnik / SparTA

kernel throughput vs cuBLAS dense

Value1.4×–2.1× (70–90% sparsity) on skinny MatMuls

BaselinecuBLAS (dense)

end-to-end tokens per GPU-second (OPT-30B/66B/175B)

Valueup to 3.8× vs DeepSpeed; up to 3.6× vs FasterTransformer

BaselineDeepSpeed / FasterTransformer

Accuracy

ValueOPT-30B: −1.44% (85.55 → 84.11); GPT-NeoX-20B: −0.72%

Baselineoriginal dense model

memory footprint enabling fewer GPUs

ValueOPT-175B inference fit in 4 A100 GPUs with Flash-LLM vs required 8 GPUs for dense

BaselineFasterTransformer dense multi-GPU

Who Should Care

What To Try In 7 Days

Prune a production model to ~80% sparsity using retraining-based pruning and validate accuracy on a representative task.

Build Tiled-CSL sparse weights and run Flash-LLM kernel on an A100 to measure tokens per GPU-second vs your current stack.

If throughput improves, swap to Flash-LLM-enabled serving to reduce GPU count and cross-GPU communication.

Optimization Features

Token Efficiency

  • Higher tokens per GPU-second by reducing global memory usage and cross-GPU comms

Infra Optimization

  • Optimized for NVIDIA Ampere A100 tensor cores; uses cp.async and tensor-core intrinsics

Model Optimization

  • Unstructured pruning (retraining-based, ~60–90% sparsity)

System Optimization

  • Improved tensor-core utilization by shifting memory pressure on-chip
  • Double-buffering and register-based sparse extraction

Training Optimization

  • Accuracy

Inference Optimization

  • Load-as-Sparse and Compute-as-Dense (sparse load, on-chip dense reconstruct)
  • Tiled-CSL sparse format and ahead-of-time data reordering
  • Two-level overlap pipeline (gmem2reg, extract, smem2tc) to hide latency

Reproducibility

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Requires retraining-based pruning to reach the targeted 60–90% sparsity with small accuracy loss.
  • Performance gains shrink when batch size (N) is large (N > ~256).
  • Increased shared-memory/register pressure can limit gains on some GPU generations.

When Not To Use

  • When you cannot or will not perform retraining-based pruning.
  • For very large batch inference where N is large (>256) and dense kernels are faster.
  • On hardware without tensor-core support or lacking cp.async semantics.

Failure Modes

  • Over-pruning beyond tested ranges can harm accuracy; Flash-LLM assumes moderate sparsity (~60–90%).
  • Shared-memory bank conflicts or register exhaustion may negate speedups on some shapes.
  • Extra preprocessing (Tiled-CSL conversion) adds offline cost and storage overhead for sparse metadata.

Core Entities

Models

  • OPT-30B
  • OPT-66B
  • OPT-175B
  • GPT-NeoX-20B

Metrics

  • tokens per GPU-second
  • kernel TFLOPs
  • kernel latency
  • tensor core utilization

Datasets

  • SuperGLUE (RTE)

Benchmarks

  • kernel SpMM TFLOPs
  • tokens per GPU-second (end-to-end)