TEAL: thresholding hidden activations to cut memory movement and speed up LLM decoding without extra training

Overview

Decision SnapshotNeeds Validation

The method is simple and training-free, produces measurable single-batch speed-ups on A100/A6000, and is compatible with quantization, but requires specialized kernels and is best for low-batch deployments.

Citations2

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

TEAL cuts memory movement during single-batch decoding and delivers up to ~1.8× throughput gain without retraining, lowering inference cost for edge or low-latency deployments.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Founder

Summary TLDR

TEAL is a simple, training-free method that zeros low-magnitude activations across every matrix in modern LLMs. Using thresholds estimated offline, TEAL reaches ~40–50% model-wide activation sparsity with small accuracy loss on Llama-2/3 and Mistral (7B–70B). With a Triton-based sparse GEMV kernel, TEAL yields practical single-batch decoding speed-ups up to ~1.5×–1.8× and works alongside weight quantization. Main limits: benefits are strongest for single-batch (edge) settings and require specialized kernels and care when sparsifying prefill tokens.

Problem Statement

Modern LLM inference is memory-bound: moving large weight tensors from off-chip memory dominates latency. Activation sparsity can avoid transferring unused weight channels, but recent LLMs lack natural ReLU sparsity and prior fixes require heavy retraining. We need a practical, training-free approach that creates activation sparsity across modern models and yields real wall-clock speed-ups.

Main Contribution

Introduce TEAL: a training-free, magnitude-based activation sparsity method that thresholds activations for all weight matrices in a Transformer block.

Block-wise greedy algorithm to assign per-layer sparsity under a model-level target, using offline activation statistics.

Key Findings

TEAL achieves 40–50% model-wide activation sparsity with small accuracy degradation on evaluated Llama-2/3 and Mistral models.

NumbersPerplexity: LLaMA-3-8B PPL 5.87→6.21 (40%), 6.67 (50%) on WikiText

Practical UseYou can apply TEAL at ~40% sparsity to cut memory movement while keeping near-baseline language-model quality for many modern LLMs.

Evidence RefTable 1; Section 5.1

Single-batch decoding wall-clock speed-ups of up to ~1.5×–1.8× observed with TEAL and a specialized sparse GEMV kernel.

NumbersTokens/sec (A6000): Llama-2-7B 50.54→77.30 (1.53× at 40%), →89.91 (1.78× at 50%)

Practical UseIf your workload is low-batch/edge inference, TEAL plus the sparse kernel can materially increase throughput without retraining.

Evidence RefTable 3; Section 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Model-wide activation sparsity	40–50%	0% (no sparsity)	↑ sparsity to 40–50%	evaluated across Llama-2/3 and Mistral families	Abstract; Section 5.1; Table 1	Table 1
Perplexity (LLaMA-3-8B on WikiText)	PPL 5.87 → 6.21 (40%) → 6.67 (50%)	5.87 (0%)	+0.34 (40%), +0.80 (50%)	WikiText validation	Table 1; Section 5.1	Table 1

What To Try In 7 Days

Run TEAL offline thresholds on your model's activations using a small generic text sample (C4-like).

Apply TEAL at 25% and 40% model-wide sparsity and measure tokens/sec and task quality on your workloads.

Integrate the provided Triton sparse GEMV kernel or adapt similar column-major masked loads for your runtime to see wall-clock benefits on A100/A6000-class GPUs.

Optimization Features

Token Efficiency

reduces weight transfer per token in single-batch decoding

Infra Optimization

reduces GPU register/global memory transfers; benefits scale with device memory bandwidth

Model Optimization

magnitude-based activation pruning (channel-wise)block-wise greedy per-layer sparsity allocation

System Optimization

PTX eviction hints to keep activations in L2 cachemask fusion to avoid extra memory writes

Training Optimization

LoRA

Inference Optimization

Triton sparse GEMV kernel with mask fusioncolumn-major storage for input sparsity; row-major for output sparsitySplitK decomposition and FP16 outer accumulation

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/FasterDecoding/TEAL

Data URLs

C4 (public)WikiText (public)

Risks & Boundaries

Limitations

Designed for single-batch/low-batch decoding; benefits fall for large batch sizes.

Sparsifying prefill tokens can break early-token attention (attention sinks); avoid sparsifying initial tokens.

When Not To Use

High-batch, throughput-first servers where weight sparsity/2:4 methods excel.

Workloads that require sparsifying LM head (not yet supported) or cannot accept any perplexity increase.

Failure Modes

Sharp quality drop at very high sparsity (≥65%) for many models.

Compound errors when combined with aggressive quantization—validate jointly.

Core Entities

Models

Llama-2Llama-3Mistral

Metrics

perplexitytokens per secondAccuracy

Datasets

WikiTextEleutherAI LM HarnessC4

Benchmarks

perplexitydownstream aggregate (MMLU, ARC, HellaSwag, GSM8K, PiQA, Winogrande)tokens/sec latency

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TEAL achieves 40–50% model-wide activation sparsity with small accuracy degradation on evaluated Llama-2/3 and Mistral models.

Single-batch decoding wall-clock speed-ups of up to ~1.5×–1.8× observed with TEAL and a specialized sparse GEMV kernel.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Measure many LLMs with only a few test items by learning weighted anchors

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding

Compress ViT with GPU-friendly 2:4 sparsity + quantization to cut size/FLOPs and speed up real GPU inference

Key finding

Trainable structured pruning + a 'collaborative' prompt compresses LLaMA-7B to 5.4B while keeping accuracy

Key finding

Practical survey of how to combine fine-tuned LLMs into one model without retraining

Key finding