Use activation entropy + channel shuffling to get one-shot N:M sparsity for LLMs with big memory and latency wins

October 24, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

4

Authors

Yun Li, Lin Niu, Xipeng Zhang, Kai Liu, Jianchen Zhu, Zhanhui Kang

Links

Abstract / PDF

Why It Matters For Business

E-Sparse cuts LLM GPU memory by ~43% and speeds matrix work 1.24–1.53× on Ampere hardware, letting teams host larger models or reduce instance costs with small accuracy trade-offs.

Summary TLDR

E-Sparse is a one-shot, post-training pruning method for LLMs that adds channel-wise information entropy to standard magnitude metrics and reorders channels (global + local shuffle) to reduce information loss from N:M sparsity. Implemented as a Sparse-GEMM in FasterTransformer, it achieves ~1.24–1.53× end-to-end GEMM speedups and ~42.6–43.5% model memory savings on LLaMA/OPT with small accuracy costs on WikiText and zero-shot tasks.

Problem Statement

N:M sparsity can speed up LLM inference on modern GPUs but damages accuracy because informative activation channels are concentrated and standard magnitude metrics miss this. Existing good-accuracy methods either need expensive weight updates or use only feature norms. We need a cheap, one-shot pruning metric and a practical channel reordering to get N:M sparsity on LLMs with low accuracy loss.

Main Contribution

Introduce an entropy-based channel importance metric that augments weight magnitude and activation norm to rank elements for N:M pruning.

Design a two-stage channel shuffle (global naive + local block greedy) that spreads information to reduce N:M pruning damage.

Implement E-Sparse Sparse-GEMM in FasterTransformer using cuSPARSE/cuSPARSELt and measure real latency and memory gains on Ampere GPUs.

Show one-shot pruning (no weight updates) that improves perplexity and zero-shot accuracy over strong post-training baselines (Wanda, SparseGPT) on LLaMA and OPT.

Key Findings

E-Sparse reduces LLaMA-13B WikiText perplexity under 2:4 sparsity to 8.26.

NumbersLLaMA-13B 2:4 perplexity = 8.26 (FP16 = 5.09)

E-Sparse outperforms Wanda and SparseGPT on average zero-shot accuracy for small LLaMA (7B) under 2:4 sparsity.

NumbersLLaMA-7B avg accuracy: E-Sparse 49.00% vs Wanda 47.68% vs SparseGPT 48.37%

Measured GEMM latency drops up to 34.8% and end-to-end layer latency reductions in range 19.6%–34.8% on A100/Ampere.

NumbersGEMM latency reduction up to 34.8%; layer reductions 19.6%–34.8%

E-Sparse saves ~42.6%–43.5% model memory on LLaMA family under 2:4 sparsity.

NumbersMemory saving = 42.64%–43.52% across LLaMA-7B to LLaMA-65B

Ablation shows entropy and the two shuffles each add gains: entropy improves perplexity by up to 0.76; GNS and LBS add up to 0.44 and 0.42.

NumbersEntropy +0.76; GNS +0.44; LBS +0.42 (perplexity gains)

Results

WikiText perplexity (LLaMA-13B, 2:4)

Value8.26

BaselineFP16 = 5.09

Accuracy

Value49.00%

BaselineWanda = 47.68%, SparseGPT = 48.37%

GEMM / layer latency reduction

Valueup to 34.8%

Baselinedense FP16 GEMM

Model memory saving (LLaMA family)

Value42.64%–43.52%

BaselineDense FP16

Who Should Care

What To Try In 7 Days

Run E-Sparse one-shot pruning (2:4) on one LLaMA variant using 128 calibration sequences from C4 and measure WikiText perplexity.

Integrate the saved sparse kernels into FasterTransformer and benchmark GEMM latency on your Ampere/A100 hardware.

Enable global naive + local block shuffle and compare accuracy vs using only activation norms (ablation).

Optimization Features

Infra Optimization

  • Optimized for NVIDIA Ampere (A100) sparse tensor cores

Model Optimization

  • N:M structured sparsity (2:4 and 4:8 patterns)
  • one-shot post-training pruning (no weight updates)

System Optimization

  • Algorithm search and caching for optimal sparse matmul per tensor shape

Inference Optimization

  • Entropy-augmented importance metric (weights + activation norm + entropy)
  • Channel shuffle (global naive + local block greedy)
  • Sparse-GEMM kernels selected via cuSPARSE / cuSPARSELt

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Tested only on NLP LLMs (LLaMA/OPT/BLOOM); applicability to vision or speech tasks is untested.
  • Experiments use public datasets with limited sentence lengths; longer contexts not fully evaluated.
  • Interaction with quantization or distillation was not studied.

When Not To Use

  • When your deployment GPUs do not support N:M sparse tensor cores (older hardware).
  • When any small perplexity increase is unacceptable for your task.
  • When you plan to combine heavy quantization or distillation without additional validation.

Failure Modes

  • Accuracy degradation increases if sparsity pattern is too aggressive for a given model.
  • Speed/memory wins depend on GPU and cuSPARSE/cuSPARSELt kernel availability and shapes.
  • Channel shuffle may be suboptimal for models or data with different activation statistics.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • LLaMA-30B
  • LLaMA-65B
  • OPT-6.7B
  • OPT-30B
  • BLOOM-176B (mentioned)

Metrics

  • Perplexity
  • Accuracy
  • GEMM / layer latency reduction
  • Memory saving (%)

Datasets

  • WikiText (validation)
  • C4 (128 calibration sequences)
  • EleutherAI LM Harness (zero-shot benchmark)

Benchmarks

  • EleutherAI LM Harness
  • WikiText perplexity