ZipLM: inference-aware structured pruning that gives runtime speedup guarantees across devices

February 7, 20239 min

Overview

Decision SnapshotReady For Pilot

Experiments cover encoder and decoder models, multiple GPUs and CPUs, and show consistent gains and runtime guarantees; code is public for reproduction.

Citations7

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 40%

Authors

Eldar Kurtic, Elias Frantar, Dan Alistarh

Links

Abstract / PDF / Code

Why It Matters For Business

ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.

Who Should Care

Summary TLDR

ZipLM is a practical structured-pruning method for Transformer language models that picks which attention heads, feed‑forward columns, or entire modules to remove based on both the accuracy loss and real measured runtime. It builds a per-layer latency lookup table, prunes one structure at a time with second-order (Hessian) information, and adds a simple token-level distillation loss. Results: better accuracy-vs-speedup than CoFi/TinyBERT on BERT tasks, matches MobileBERT by pruning BERTlarge, and compresses GPT2 to be up to 60% smaller and ~30% faster than DistilGPT2 on evaluated benchmarks. ZipLM also produces a family of models for multiple target speedups in one run and keeps measured vs.

Problem Statement

Large Transformer models give strong accuracy but high inference cost. Structured pruning (remove heads/columns/modules) is attractive for real speedups but is fragile: prior methods either need expensive retraining, ignore runtime differences across hardware, or require manual distillation layer mapping. We need a pruning method that is accurate, fast to run, works post‑training or gradually, and guarantees real runtime speedups in a target environment.

Main Contribution

Inference‑aware structured pruning algorithm that jointly uses weight magnitude, activation influence, and redundancy (Hessian) and prunes one structure at a time to respect correlations

Per‑layer latency table and search to meet explicit speedup/latency targets in a given inference environment

Key Findings

ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.

Numbers≈ +3 F1 points vs CoFi at same speedup (SQuAD dev)

Practical UseIf you need a faster BERT with similar latency, try ZipLM first to get higher accuracy at the same runtime.

Evidence RefSection 4.1, Figure 2; Table 5

ZipLM reaches industry 99% accuracy recovery at larger speedups than prior methods.

NumbersBERTbase 99% recovery: 5x (SQuAD), 6x (QNLI/MNLI), 13x (SST-2), 15x (QQP)

Practical UseFor production thresholds like MLPerf's 99% accuracy, ZipLM often yields substantially faster models than alternatives.

Evidence RefSection 4.1 (99% recovery paragraph)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SQuAD F1 vs CoFi/TinyBERT+3 F1 points at same speedup (ZipLM > CoFi/TinyBERT)CoFi / TinyBERT+3 F1SQuAD devFigure 2; Table 5Section 4.1
Accuracy5x (SQuAD), 6x (QNLI/MNLI), 13x (SST-2), 15x (QQP)dense BERTbasemeets 99% recovery at these speedupsvarious dev sets99% recovery paragraph in Section 4.1Section 4.1

What To Try In 7 Days

Run ZipLM on a production checkpoint with a short calibration set and your target latency table to get 2x–5x models quickly

Generate a per-layer latency table (few runs) on your target device and use ZipLM to meet hard latency or throughput SLAs

For edge builds, chain ZipLM + unstructured pruning + INT8 quantization and measure end-to-end CPU latency

Optimization Features

Infra Optimization
optimizes for device specifics (V100, A100, CPU)reduces end-to-end GPU hours for families of models
Model Optimization
structured_pruningattention_head_pruningfeedforward_intermediate_shrinkingmodule_removal (depth reduction)
System Optimization
compatible with GPU and CPU inference enginesworks with DeepSparse for CPU deployment
Training Optimization
layerwise_token_distillationpost-training one-shot pruninggradual pruning with in-between finetuning
Inference Optimization
inference-aware pruning (latency table)speedup-targeted searchpruning that preserves measured latency/throughput

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Benchmarks are English-only; performance in low-resource languages is untested

Relies on calibration data and a correct latency table for the target device

When Not To Use

When absolute top accuracy matters and any small drop is unacceptable

If you cannot run short latency benchmarks on the target device (no latency table)

Failure Modes

Speedup estimates may be off if the inference stack changes; observed deviations up to 5.28%

Very small calibration sets can reduce final accuracy unless token distillation is enabled

Core Entities

Models

BERTbaseBERTlargeGPT2 (124M)ZipBERTZipGPT2CoFiTinyBERTDistilGPT2MobileBERTOptimal Brain SurgeonSPDY

Metrics

F1AccuracyPerplexitySpeedupLatencyThroughput

Datasets

SQuADv1.1GLUE (subset: SST-2, QNLI, MNLI, QQP, CoLA, MRPC, STS-B, RTE)OpenWebTextCorpusWikiText

Benchmarks

SQuADGLUEMLPerf