Halve embedding size and prune heads to cut transformer memory and latency for edge devices

December 25, 20245 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

0

Authors

Krisvarish V, Priyadarshini T, K P Abhishek Sri Saai, Vaidehi Vijayakumar

Links

Abstract / PDF

Why It Matters For Business

Cutting model size and latency lets teams run transformers on phones and edge devices, reducing server costs and improving responsiveness.

Summary TLDR

This paper presents a lightweight transformer variant that halves embedding dimensionality, reduces attention heads, and applies pruning/quantization. On a small CPU-based prototype the authors report ~52% lower memory (1,122,304 → 536,576 bytes), ~34% faster execution (0.02408s → 0.01596s), and ~52% fewer parameters (140,288 → 67,072) compared to a baseline transformer. Claims on preserved accuracy are qualitative only. The implementation and tests use NumPy, short sequences (max length 10), and no GPU.

Problem Statement

Standard transformers use significant memory and compute, blocking deployment on mobile and edge devices. The paper aims to cut the model footprint and runtime while keeping performance close to the original.

Main Contribution

Proposes a transformer variant that halves embedding dimension and reduces attention heads to lower resource needs.

Applies parameter pruning and quantization as complementary techniques to shrink memory and compute.

Provides a CPU-based NumPy prototype and compares memory, execution time, and parameter counts against the original transformer and mentions MobileBERT/DistilBERT.

Key Findings

Memory footprint roughly halved versus the original transformer.

Numbers1,122,304 → 536,576 bytes (−52%)

Execution time decreased by about one-third on CPU prototype.

Numbers0.024081s → 0.015955s (−34%)

Parameter count reduced by ~52% through embedding and head reduction.

Numbers140,288 → 67,072 params (−52%)

Authors claim performance remains close to standard transformers but provide no task accuracy numbers.

Results

Memory Usage

Value536,576 bytes

Baseline1,122,304 bytes

Execution Time

Value0.015955 s

Baseline0.024081 s

Parameter Count

Value67,072

Baseline140,288

Who Should Care

What To Try In 7 Days

Prototype halved-embedding transformer on a small CPU using NumPy to reproduce memory/latency gains.

Reduce attention heads (e.g., 8→4) and measure latency and memory trade-offs on your tasks.

Add simple weight quantization and compare inference latency and accuracy on target data.

Optimization Features

Infra Optimization

  • designed to run without GPU

Model Optimization

  • embedding dimension reduction
  • attention head reduction
  • parameter pruning

System Optimization

  • reduced memory footprint for edge devices

Inference Optimization

  • weight quantization
  • smaller model for faster CPU inference

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • No task accuracy or standard benchmark numbers provided to verify preserved performance.
  • Experiments run with max sequence length 10 and on CPU only, limiting generality for real applications.
  • Comparisons to MobileBERT/DistilBERT lack quantitative benchmark results in the paper.

When Not To Use

  • When you need verified state-of-the-art accuracy on standard benchmarks.
  • For long-context tasks (sequence lengths >> 10) without re-evaluation.
  • When you must run on GPU-optimized production pipelines expecting larger pre-trained models.

Failure Modes

  • Unmeasured accuracy drop on real-world datasets after embedding reduction.
  • Pretrained weights may not transfer to the halved-embedding architecture.
  • Performance gains may vanish for longer sequences or larger vocabularies.

Core Entities

Models

  • Original Transformer
  • Resource-Efficient Transformer
  • MobileBERT
  • DistilBERT

Metrics

  • memory usage
  • execution time
  • parameter count