TEAL: thresholding hidden activations to cut memory movement and speed up LLM decoding without extra training

August 26, 20247 min

Overview

Decision SnapshotNeeds Validation

The method is simple and training-free, produces measurable single-batch speed-ups on A100/A6000, and is compatible with quantization, but requires specialized kernels and is best for low-batch deployments.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TEAL cuts memory movement during single-batch decoding and delivers up to ~1.8× throughput gain without retraining, lowering inference cost for edge or low-latency deployments.

Who Should Care

Summary TLDR

TEAL is a simple, training-free method that zeros low-magnitude activations across every matrix in modern LLMs. Using thresholds estimated offline, TEAL reaches ~40–50% model-wide activation sparsity with small accuracy loss on Llama-2/3 and Mistral (7B–70B). With a Triton-based sparse GEMV kernel, TEAL yields practical single-batch decoding speed-ups up to ~1.5×–1.8× and works alongside weight quantization. Main limits: benefits are strongest for single-batch (edge) settings and require specialized kernels and care when sparsifying prefill tokens.

Problem Statement

Modern LLM inference is memory-bound: moving large weight tensors from off-chip memory dominates latency. Activation sparsity can avoid transferring unused weight channels, but recent LLMs lack natural ReLU sparsity and prior fixes require heavy retraining. We need a practical, training-free approach that creates activation sparsity across modern models and yields real wall-clock speed-ups.

Main Contribution

Introduce TEAL: a training-free, magnitude-based activation sparsity method that thresholds activations for all weight matrices in a Transformer block.

Block-wise greedy algorithm to assign per-layer sparsity under a model-level target, using offline activation statistics.

Key Findings

TEAL achieves 40–50% model-wide activation sparsity with small accuracy degradation on evaluated Llama-2/3 and Mistral models.

NumbersPerplexity: LLaMA-3-8B PPL 5.876.21 (40%), 6.67 (50%) on WikiText

Practical UseYou can apply TEAL at ~40% sparsity to cut memory movement while keeping near-baseline language-model quality for many modern LLMs.

Evidence RefTable 1; Section 5.1

Single-batch decoding wall-clock speed-ups of up to ~1.5×–1.8× observed with TEAL and a specialized sparse GEMV kernel.

NumbersTokens/sec (A6000): Llama-2-7B 50.5477.30 (1.53× at 40%), →89.91 (1.78× at 50%)

Practical UseIf your workload is low-batch/edge inference, TEAL plus the sparse kernel can materially increase throughput without retraining.

Evidence RefTable 3; Section 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Model-wide activation sparsity4050%0% (no sparsity)↑ sparsity to 4050%evaluated across Llama-2/3 and Mistral familiesAbstract; Section 5.1; Table 1Table 1
Perplexity (LLaMA-3-8B on WikiText)PPL 5.876.21 (40%) → 6.67 (50%)5.87 (0%)+0.34 (40%), +0.80 (50%)WikiText validationTable 1; Section 5.1Table 1

What To Try In 7 Days

Run TEAL offline thresholds on your model's activations using a small generic text sample (C4-like).

Apply TEAL at 25% and 40% model-wide sparsity and measure tokens/sec and task quality on your workloads.

Integrate the provided Triton sparse GEMV kernel or adapt similar column-major masked loads for your runtime to see wall-clock benefits on A100/A6000-class GPUs.

Optimization Features

Token Efficiency
reduces weight transfer per token in single-batch decoding
Infra Optimization
reduces GPU register/global memory transfers; benefits scale with device memory bandwidth
Model Optimization
magnitude-based activation pruning (channel-wise)block-wise greedy per-layer sparsity allocation
System Optimization
PTX eviction hints to keep activations in L2 cachemask fusion to avoid extra memory writes
Training Optimization
LoRA
Inference Optimization
Triton sparse GEMV kernel with mask fusioncolumn-major storage for input sparsity; row-major for output sparsitySplitK decomposition and FP16 outer accumulation

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

C4 (public)WikiText (public)

Risks & Boundaries

Limitations

Designed for single-batch/low-batch decoding; benefits fall for large batch sizes.

Sparsifying prefill tokens can break early-token attention (attention sinks); avoid sparsifying initial tokens.

When Not To Use

High-batch, throughput-first servers where weight sparsity/2:4 methods excel.

Workloads that require sparsifying LM head (not yet supported) or cannot accept any perplexity increase.

Failure Modes

Sharp quality drop at very high sparsity (≥65%) for many models.

Compound errors when combined with aggressive quantization—validate jointly.

Core Entities

Models

Llama-2Llama-3Mistral

Metrics

perplexitytokens per secondAccuracy

Datasets

WikiTextEleutherAI LM HarnessC4

Benchmarks

perplexitydownstream aggregate (MMLU, ARC, HellaSwag, GSM8K, PiQA, Winogrande)tokens/sec latency