Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
TEAL cuts memory movement during single-batch decoding and delivers up to ~1.8× throughput gain without retraining, lowering inference cost for edge or low-latency deployments.
Summary TLDR
TEAL is a simple, training-free method that zeros low-magnitude activations across every matrix in modern LLMs. Using thresholds estimated offline, TEAL reaches ~40–50% model-wide activation sparsity with small accuracy loss on Llama-2/3 and Mistral (7B–70B). With a Triton-based sparse GEMV kernel, TEAL yields practical single-batch decoding speed-ups up to ~1.5×–1.8× and works alongside weight quantization. Main limits: benefits are strongest for single-batch (edge) settings and require specialized kernels and care when sparsifying prefill tokens.
Problem Statement
Modern LLM inference is memory-bound: moving large weight tensors from off-chip memory dominates latency. Activation sparsity can avoid transferring unused weight channels, but recent LLMs lack natural ReLU sparsity and prior fixes require heavy retraining. We need a practical, training-free approach that creates activation sparsity across modern models and yields real wall-clock speed-ups.
Main Contribution
Introduce TEAL: a training-free, magnitude-based activation sparsity method that thresholds activations for all weight matrices in a Transformer block.
Block-wise greedy algorithm to assign per-layer sparsity under a model-level target, using offline activation statistics.
A Triton-based sparse GEMV kernel with mask fusion, FP16 SplitK accumulation, and cache eviction hints to realize real wall-clock speed-ups.
Empirical results across Llama-2, Llama-3, and Mistral (7B–70B): 40–50% sparsity with minor accuracy loss and up to 1.8× single-batch decoding speed-up. Demonstrated compatibility with common post-training weight quantizers.
Key Findings
TEAL achieves 40–50% model-wide activation sparsity with small accuracy degradation on evaluated Llama-2/3 and Mistral models.
Single-batch decoding wall-clock speed-ups of up to ~1.5×–1.8× observed with TEAL and a specialized sparse GEMV kernel.
TEAL is compatible with common weight quantization schemes; errors from sparsity and quantization compound but largely independently.
TEAL's block-wise greedy sparsity beats uniform sparsity and the CATS baseline in end metrics.
Results
Model-wide activation sparsity
Perplexity (LLaMA-3-8B on WikiText)
Accuracy
End-to-end single-batch throughput (tokens/sec)
Who Should Care
What To Try In 7 Days
Run TEAL offline thresholds on your model's activations using a small generic text sample (C4-like).
Apply TEAL at 25% and 40% model-wide sparsity and measure tokens/sec and task quality on your workloads.
Integrate the provided Triton sparse GEMV kernel or adapt similar column-major masked loads for your runtime to see wall-clock benefits on A100/A6000-class GPUs.
Optimization Features
Token Efficiency
- reduces weight transfer per token in single-batch decoding
Infra Optimization
- reduces GPU register/global memory transfers; benefits scale with device memory bandwidth
Model Optimization
- magnitude-based activation pruning (channel-wise)
- block-wise greedy per-layer sparsity allocation
System Optimization
- PTX eviction hints to keep activations in L2 cache
- mask fusion to avoid extra memory writes
Training Optimization
- LoRA
Inference Optimization
- Triton sparse GEMV kernel with mask fusion
- column-major storage for input sparsity; row-major for output sparsity
- SplitK decomposition and FP16 outer accumulation
Reproducibility
Data Urls
- C4 (public)
- WikiText (public)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Designed for single-batch/low-batch decoding; benefits fall for large batch sizes.
- Sparsifying prefill tokens can break early-token attention (attention sinks); avoid sparsifying initial tokens.
- Requires specialized sparse GEMV kernel and memory-format changes to realize wall-clock gains.
- Quality degrades rapidly past ~65% sparsity for most models.
When Not To Use
- High-batch, throughput-first servers where weight sparsity/2:4 methods excel.
- Workloads that require sparsifying LM head (not yet supported) or cannot accept any perplexity increase.
- Environments without ability to deploy custom Triton kernels or adjust storage format.
Failure Modes
- Sharp quality drop at very high sparsity (≥65%) for many models.
- Compound errors when combined with aggressive quantization—validate jointly.
- Attention-sink sensitivity if early prefill tokens are sparsified.
- Batch inference may lose effectiveness when inputs prefer different sparsity masks.
Core Entities
Models
- Llama-2
- Llama-3
- Mistral
Metrics
- perplexity
- tokens per second
- Accuracy
Datasets
- WikiText
- EleutherAI LM Harness
- C4
Benchmarks
- perplexity
- downstream aggregate (MMLU, ARC, HellaSwag, GSM8K, PiQA, Winogrande)
- tokens/sec latency

