TEAL: thresholding hidden activations to cut memory movement and speed up LLM decoding without extra training

August 26, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun

Links

Abstract / PDF

Why It Matters For Business

TEAL cuts memory movement during single-batch decoding and delivers up to ~1.8× throughput gain without retraining, lowering inference cost for edge or low-latency deployments.

Summary TLDR

TEAL is a simple, training-free method that zeros low-magnitude activations across every matrix in modern LLMs. Using thresholds estimated offline, TEAL reaches ~40–50% model-wide activation sparsity with small accuracy loss on Llama-2/3 and Mistral (7B–70B). With a Triton-based sparse GEMV kernel, TEAL yields practical single-batch decoding speed-ups up to ~1.5×–1.8× and works alongside weight quantization. Main limits: benefits are strongest for single-batch (edge) settings and require specialized kernels and care when sparsifying prefill tokens.

Problem Statement

Modern LLM inference is memory-bound: moving large weight tensors from off-chip memory dominates latency. Activation sparsity can avoid transferring unused weight channels, but recent LLMs lack natural ReLU sparsity and prior fixes require heavy retraining. We need a practical, training-free approach that creates activation sparsity across modern models and yields real wall-clock speed-ups.

Main Contribution

Introduce TEAL: a training-free, magnitude-based activation sparsity method that thresholds activations for all weight matrices in a Transformer block.

Block-wise greedy algorithm to assign per-layer sparsity under a model-level target, using offline activation statistics.

A Triton-based sparse GEMV kernel with mask fusion, FP16 SplitK accumulation, and cache eviction hints to realize real wall-clock speed-ups.

Empirical results across Llama-2, Llama-3, and Mistral (7B–70B): 40–50% sparsity with minor accuracy loss and up to 1.8× single-batch decoding speed-up. Demonstrated compatibility with common post-training weight quantizers.

Key Findings

TEAL achieves 40–50% model-wide activation sparsity with small accuracy degradation on evaluated Llama-2/3 and Mistral models.

NumbersPerplexity: LLaMA-3-8B PPL 5.87→6.21 (40%), 6.67 (50%) on WikiText

Single-batch decoding wall-clock speed-ups of up to ~1.5×–1.8× observed with TEAL and a specialized sparse GEMV kernel.

NumbersTokens/sec (A6000): Llama-2-7B 50.54→77.30 (1.53× at 40%), →89.91 (1.78× at 50%)

TEAL is compatible with common weight quantization schemes; errors from sparsity and quantization compound but largely independently.

NumbersPerplexity vs sparsity trends similar across 8-bit RTN, 4-bit AWQ, 2/3-bit QuIP# on Llama-2-7B

TEAL's block-wise greedy sparsity beats uniform sparsity and the CATS baseline in end metrics.

NumbersDownstream avg (Llama-3-8B): baseline 68.07 → TEAL 67.73 (25%), CATS 64.15 (25%)

Results

Model-wide activation sparsity

Value40–50%

Baseline0% (no sparsity)

Perplexity (LLaMA-3-8B on WikiText)

ValuePPL 5.87 → 6.21 (40%) → 6.67 (50%)

Baseline5.87 (0%)

Accuracy

Value56.50 → 55.45 (40%) → 54.26 (50%)

Baseline56.50 (0%)

End-to-end single-batch throughput (tokens/sec)

ValueLlama-2-7B on A6000: 50.54 → 77.30 (40%) → 89.91 (50%)

Baseline50.54 (0%)

Who Should Care

What To Try In 7 Days

Run TEAL offline thresholds on your model's activations using a small generic text sample (C4-like).

Apply TEAL at 25% and 40% model-wide sparsity and measure tokens/sec and task quality on your workloads.

Integrate the provided Triton sparse GEMV kernel or adapt similar column-major masked loads for your runtime to see wall-clock benefits on A100/A6000-class GPUs.

Optimization Features

Token Efficiency

  • reduces weight transfer per token in single-batch decoding

Infra Optimization

  • reduces GPU register/global memory transfers; benefits scale with device memory bandwidth

Model Optimization

  • magnitude-based activation pruning (channel-wise)
  • block-wise greedy per-layer sparsity allocation

System Optimization

  • PTX eviction hints to keep activations in L2 cache
  • mask fusion to avoid extra memory writes

Training Optimization

  • LoRA

Inference Optimization

  • Triton sparse GEMV kernel with mask fusion
  • column-major storage for input sparsity; row-major for output sparsity
  • SplitK decomposition and FP16 outer accumulation

Reproducibility

Data Urls

  • C4 (public)
  • WikiText (public)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Designed for single-batch/low-batch decoding; benefits fall for large batch sizes.
  • Sparsifying prefill tokens can break early-token attention (attention sinks); avoid sparsifying initial tokens.
  • Requires specialized sparse GEMV kernel and memory-format changes to realize wall-clock gains.
  • Quality degrades rapidly past ~65% sparsity for most models.

When Not To Use

  • High-batch, throughput-first servers where weight sparsity/2:4 methods excel.
  • Workloads that require sparsifying LM head (not yet supported) or cannot accept any perplexity increase.
  • Environments without ability to deploy custom Triton kernels or adjust storage format.

Failure Modes

  • Sharp quality drop at very high sparsity (≥65%) for many models.
  • Compound errors when combined with aggressive quantization—validate jointly.
  • Attention-sink sensitivity if early prefill tokens are sparsified.
  • Batch inference may lose effectiveness when inputs prefer different sparsity masks.

Core Entities

Models

  • Llama-2
  • Llama-3
  • Mistral

Metrics

  • perplexity
  • tokens per second
  • Accuracy

Datasets

  • WikiText
  • EleutherAI LM Harness
  • C4

Benchmarks

  • perplexity
  • downstream aggregate (MMLU, ARC, HellaSwag, GSM8K, PiQA, Winogrande)
  • tokens/sec latency