Overview
The method is simple and training-free, produces measurable single-batch speed-ups on A100/A6000, and is compatible with quantization, but requires specialized kernels and is best for low-batch deployments.
Citations2
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
TEAL cuts memory movement during single-batch decoding and delivers up to ~1.8× throughput gain without retraining, lowering inference cost for edge or low-latency deployments.
Who Should Care
Summary TLDR
TEAL is a simple, training-free method that zeros low-magnitude activations across every matrix in modern LLMs. Using thresholds estimated offline, TEAL reaches ~40–50% model-wide activation sparsity with small accuracy loss on Llama-2/3 and Mistral (7B–70B). With a Triton-based sparse GEMV kernel, TEAL yields practical single-batch decoding speed-ups up to ~1.5×–1.8× and works alongside weight quantization. Main limits: benefits are strongest for single-batch (edge) settings and require specialized kernels and care when sparsifying prefill tokens.
Problem Statement
Modern LLM inference is memory-bound: moving large weight tensors from off-chip memory dominates latency. Activation sparsity can avoid transferring unused weight channels, but recent LLMs lack natural ReLU sparsity and prior fixes require heavy retraining. We need a practical, training-free approach that creates activation sparsity across modern models and yields real wall-clock speed-ups.
Main Contribution
Introduce TEAL: a training-free, magnitude-based activation sparsity method that thresholds activations for all weight matrices in a Transformer block.
Block-wise greedy algorithm to assign per-layer sparsity under a model-level target, using offline activation statistics.
Key Findings
TEAL achieves 40–50% model-wide activation sparsity with small accuracy degradation on evaluated Llama-2/3 and Mistral models.
Single-batch decoding wall-clock speed-ups of up to ~1.5×–1.8× observed with TEAL and a specialized sparse GEMV kernel.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Model-wide activation sparsity | 40–50% | 0% (no sparsity) | ↑ sparsity to 40–50% | evaluated across Llama-2/3 and Mistral families | Abstract; Section 5.1; Table 1 | Table 1 |
| Perplexity (LLaMA-3-8B on WikiText) | PPL 5.87 → 6.21 (40%) → 6.67 (50%) | 5.87 (0%) | +0.34 (40%), +0.80 (50%) | WikiText validation | Table 1; Section 5.1 | Table 1 |
What To Try In 7 Days
Run TEAL offline thresholds on your model's activations using a small generic text sample (C4-like).
Apply TEAL at 25% and 40% model-wide sparsity and measure tokens/sec and task quality on your workloads.
Integrate the provided Triton sparse GEMV kernel or adapt similar column-major masked loads for your runtime to see wall-clock benefits on A100/A6000-class GPUs.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Designed for single-batch/low-batch decoding; benefits fall for large batch sizes.
Sparsifying prefill tokens can break early-token attention (attention sinks); avoid sparsifying initial tokens.
When Not To Use
High-batch, throughput-first servers where weight sparsity/2:4 methods excel.
Workloads that require sparsifying LM head (not yet supported) or cannot accept any perplexity increase.
Failure Modes
Sharp quality drop at very high sparsity (≥65%) for many models.
Compound errors when combined with aggressive quantization—validate jointly.

