ZipLM: inference-aware structured pruning that gives runtime speedup guarantees across devices

February 7, 20239 min

Overview

Production Readiness

0.8

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

7

Authors

Eldar Kurtic, Elias Frantar, Dan Alistarh

Links

Abstract / PDF

Why It Matters For Business

ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.

Summary TLDR

ZipLM is a practical structured-pruning method for Transformer language models that picks which attention heads, feed‑forward columns, or entire modules to remove based on both the accuracy loss and real measured runtime. It builds a per-layer latency lookup table, prunes one structure at a time with second-order (Hessian) information, and adds a simple token-level distillation loss. Results: better accuracy-vs-speedup than CoFi/TinyBERT on BERT tasks, matches MobileBERT by pruning BERTlarge, and compresses GPT2 to be up to 60% smaller and ~30% faster than DistilGPT2 on evaluated benchmarks. ZipLM also produces a family of models for multiple target speedups in one run and keeps measured vs.

Problem Statement

Large Transformer models give strong accuracy but high inference cost. Structured pruning (remove heads/columns/modules) is attractive for real speedups but is fragile: prior methods either need expensive retraining, ignore runtime differences across hardware, or require manual distillation layer mapping. We need a pruning method that is accurate, fast to run, works post‑training or gradually, and guarantees real runtime speedups in a target environment.

Main Contribution

Inference‑aware structured pruning algorithm that jointly uses weight magnitude, activation influence, and redundancy (Hessian) and prunes one structure at a time to respect correlations

Per‑layer latency table and search to meet explicit speedup/latency targets in a given inference environment

Layer‑wise token distillation (match token vectors) that avoids manual layer matching and helps low-data tasks

Applies to both encoder (BERT) and decoder (GPT2) models, in both post‑training (one‑shot) and gradual pruning settings

Practical pipeline that yields a family of compressed models (multiple speedups) in one run and is compatible with unstructured pruning and quantization

Key Findings

ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.

Numbers≈ +3 F1 points vs CoFi at same speedup (SQuAD dev)

ZipLM reaches industry 99% accuracy recovery at larger speedups than prior methods.

NumbersBERTbase 99% recovery: 5x (SQuAD), 6x (QNLI/MNLI), 13x (SST-2), 15x (QQP)

ZipLM generates a full family of compressed models much cheaper than CoFi.

Numbers115 vs 560 epochs to produce models for 2x–15x → 4.87× fewer epochs

ZipGPT2 can be much smaller and faster while improving quality vs DistilGPT2.

NumbersReduced decoder 42.5M → 26.5M (≈60% smaller) and speedup 1.6x → 2.1x (~30% faster) with lower perplexity

ZipLM achieves reliable speedup targets in practice.

NumbersTarget vs achieved speedup deviation ≤ 5.28% on measured devices

ZipLM is robust to small calibration sets.

NumbersOutperforms prior post‑training pruning from 32 calibration samples; small gains up to 4096 samples

ZipLM improves CPU compound compression pipelines.

NumbersCompound pipeline speedup improved from 3x→13x at full recovery, and 30x→50x at largest compression

Results

SQuAD F1 vs CoFi/TinyBERT

Value+3 F1 points at same speedup (ZipLM > CoFi/TinyBERT)

BaselineCoFi / TinyBERT

Accuracy

Value5x (SQuAD), 6x (QNLI/MNLI), 13x (SST-2), 15x (QQP)

Baselinedense BERTbase

Epochs to produce 2x–15x family (efficiency)

Value115 epochs total

BaselineCoFi 560 epochs

ZipGPT2 size and speed vs DistilGPT2

ValueDecoder 42.5M → 26.5M (≈60% smaller); speedup 1.6x → 2.1x (~30% faster); lower perplexity

BaselineDistilGPT2

Target vs achieved speedup deviation

Value≤ 5.28% deviation

BaselineTarget speedup

CPU compound compression improvement

Value3x → 13x at full recovery; 30x → 50x at max compression

BaselineLayer-dropping structured stage in prior pipeline

Who Should Care

What To Try In 7 Days

Run ZipLM on a production checkpoint with a short calibration set and your target latency table to get 2x–5x models quickly

Generate a per-layer latency table (few runs) on your target device and use ZipLM to meet hard latency or throughput SLAs

For edge builds, chain ZipLM + unstructured pruning + INT8 quantization and measure end-to-end CPU latency

Optimization Features

Infra Optimization

  • optimizes for device specifics (V100, A100, CPU)
  • reduces end-to-end GPU hours for families of models

Model Optimization

  • structured_pruning
  • attention_head_pruning
  • feedforward_intermediate_shrinking
  • module_removal (depth reduction)

System Optimization

  • compatible with GPU and CPU inference engines
  • works with DeepSparse for CPU deployment

Training Optimization

  • layerwise_token_distillation
  • post-training one-shot pruning
  • gradual pruning with in-between finetuning

Inference Optimization

  • inference-aware pruning (latency table)
  • speedup-targeted search
  • pruning that preserves measured latency/throughput

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks are English-only; performance in low-resource languages is untested
  • Relies on calibration data and a correct latency table for the target device
  • Extremely aggressive compression may still need finetuning for final accuracy

When Not To Use

  • When absolute top accuracy matters and any small drop is unacceptable
  • If you cannot run short latency benchmarks on the target device (no latency table)
  • For non-Transformer model families not evaluated in the paper

Failure Modes

  • Speedup estimates may be off if the inference stack changes; observed deviations up to 5.28%
  • Very small calibration sets can reduce final accuracy unless token distillation is enabled
  • Pruning can produce degenerate architectures if hardware-specific runtimes are not measured

Core Entities

Models

  • BERTbase
  • BERTlarge
  • GPT2 (124M)
  • ZipBERT
  • ZipGPT2
  • CoFi
  • TinyBERT
  • DistilGPT2
  • MobileBERT
  • Optimal Brain Surgeon
  • SPDY

Metrics

  • F1
  • Accuracy
  • Perplexity
  • Speedup
  • Latency
  • Throughput

Datasets

  • SQuADv1.1
  • GLUE (subset: SST-2, QNLI, MNLI, QQP, CoLA, MRPC, STS-B, RTE)
  • OpenWebTextCorpus
  • WikiText

Benchmarks

  • SQuAD
  • GLUE
  • MLPerf