Combine sparse neuron activity with weight pruning to cut RNN inference work up to ~20× while keeping language-model quality

November 13, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

1

Authors

Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney

Links

Abstract / PDF

Why It Matters For Business

If you can deploy on event-driven or neuromorphic hardware, combining sparse activations with weight pruning can cut inference work dramatically without large quality loss, lowering energy and latency for low-power or real-time apps.

Summary TLDR

The paper adapts an event-driven GRU (EGRU) that produces sparse activations and shows that combining that activity sparsity with standard unstructured weight pruning multiplies efficiency gains. On Penn Treebank they report up to ~20× lower multiply-accumulate (MAC) work with test perplexity still under 60. Activity sparsity is controllable via weight decay. The approach is compelling for event-driven neuromorphic hardware but hard to accelerate on today’s GPUs because the sparsity is unstructured and dynamic.

Problem Statement

Neural networks are costly to run, especially for single-sample (batch=1) inference where weight fetches dominate energy and latency. Prior work focused mostly on pruning weights or quantizing them. Dynamic sparse neuron activations (activity sparsity) are less used but could reduce memory fetches and arithmetic if combined with weight pruning. The interaction and practical gains of combining both sparsities for RNN inference are unclear.

Main Contribution

Show that activity sparsity (sparse neuron outputs) multiplies with unstructured weight sparsity to reduce required MACs approximately by factor λ_activity × λ_weight.

Use an event-driven GRU (EGRU) that thresholds cell states to produce sparse activations and tune its sparsity via weight decay.

Demonstrate up to ≈20× theoretical reduction in MACs on Penn Treebank while keeping test perplexity under 60 at evaluated settings.

Key Findings

Activity sparsity and weight sparsity multiply to reduce operations.

NumbersEffective operations scale ≈ λ_a × λ_w (analytic relation)

Up to ~20× reduction in theoretical MACs on Penn Treebank with small perplexity loss.

Numbers20× MAC reduction; test PPL ≈ 58.9 at 85% weight sparsity (EGRU)

Weight pruning up to 85% compresses EGRU with small task loss on PTB.

NumbersEGRU test PPL 57.06 → 58.88 at 85% weight sparsity (mean)

Weight decay controls activity sparsity.

NumbersValidation performance optimum near weight decay ≈ 0.14; weight decay shifts weight distributions toward zero and alters

Results

MAC reduction (theoretical)

Value≈20× reduction vs dense LSTM baseline on Penn Treebank

BaselineMerity LSTM dense: 20.2M MAC

Test perplexity (EGRU)

Value≈58.9 (mean) at 85% weight sparsity

BaselineEGRU dense test PPL 57.06

Test perplexity (LSTM baseline)

Value≈57.1 (our LSTM reimplementation) and 57.3 (Merity reported)

BaselineMerity et al. reported 57.3 with 20.2M MAC

Sensitivity to pruning (WikiText-2)

ValuePerformance degrades earlier on WikiText-2 vs PTB; e.g., EGRU test PPL ≈70.85 at 85% sparsity

BaselineEGRU dense PPL ≈67.21

Who Should Care

What To Try In 7 Days

Re-implement an RNN (GRU/LSTM) baseline and measure MACs as a budget metric.

Train an event-driven GRU (EGRU) or add a Heaviside-thresholded activation to one layer to observe activity sparsity.

Apply global magnitude pruning iteratively (train→prune→fine-tune) to weights except embeddings and track perplexity vs MACs.

Optimization Features

Infra Optimization

  • Not GPU-friendly due to irregular sparsity

Model Optimization

  • Pruning
  • ActivitySparsity

System Optimization

  • Target neuromorphic accelerators (Loihi, SpiNNaker2)

Training Optimization

  • Iterative magnitude pruning (train→prune→fine-tune)
  • Weight decay tuning

Inference Optimization

  • MAC counting for efficiency
  • Leverage event-driven execution

Reproducibility

Data Urls

  • Penn Treebank (standard)
  • WikiText-2 (standard)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Sparsity is unstructured and dynamic; mainstream GPUs cannot realize the theoretical MAC savings easily
  • EGRU requires larger word embeddings, which reduces net MAC savings (effective activity savings limited to ≈3× in parts)
  • Reported gains are on small language datasets (Penn Treebank, WikiText-2); generalization to large models not shown

When Not To Use

  • If you must run on standard GPU servers without sparse-dynamic support
  • When embeddings dominate compute or memory and cannot be reduced
  • When deterministic, regular memory access patterns are required for latency or compiler optimizations

Failure Modes

  • Quality drops quickly when weight sparsity exceeds ~85% or when combined with high activity loss on larger datasets
  • Dynamic activation sparsity can make memory fetch scheduling unpredictable, hurting latency on non-event hardware
  • Imbalanced layer activities (final layer high activity) can reduce end-to-end savings

Core Entities

Models

  • Event-based GRU (EGRU)
  • LSTM baseline

Metrics

  • perplexity
  • multiply-accumulate operations (MACs)

Datasets

  • Penn Treebank
  • WikiText-2

Context Entities

Models

  • Spiking neural networks (SNNs)
  • GRU
  • AWD-LSTM (reference)

Metrics

  • perplexity (language modeling)

Datasets

  • Penn Treebank (reference)
  • WikiText-2 (reference)