Overview
Production Readiness
0.8
Novelty Score
0.4
Cost Impact Score
0.8
Citation Count
7
Why It Matters For Business
ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.
Summary TLDR
ZipLM is a practical structured-pruning method for Transformer language models that picks which attention heads, feed‑forward columns, or entire modules to remove based on both the accuracy loss and real measured runtime. It builds a per-layer latency lookup table, prunes one structure at a time with second-order (Hessian) information, and adds a simple token-level distillation loss. Results: better accuracy-vs-speedup than CoFi/TinyBERT on BERT tasks, matches MobileBERT by pruning BERTlarge, and compresses GPT2 to be up to 60% smaller and ~30% faster than DistilGPT2 on evaluated benchmarks. ZipLM also produces a family of models for multiple target speedups in one run and keeps measured vs.
Problem Statement
Large Transformer models give strong accuracy but high inference cost. Structured pruning (remove heads/columns/modules) is attractive for real speedups but is fragile: prior methods either need expensive retraining, ignore runtime differences across hardware, or require manual distillation layer mapping. We need a pruning method that is accurate, fast to run, works post‑training or gradually, and guarantees real runtime speedups in a target environment.
Main Contribution
Inference‑aware structured pruning algorithm that jointly uses weight magnitude, activation influence, and redundancy (Hessian) and prunes one structure at a time to respect correlations
Per‑layer latency table and search to meet explicit speedup/latency targets in a given inference environment
Layer‑wise token distillation (match token vectors) that avoids manual layer matching and helps low-data tasks
Applies to both encoder (BERT) and decoder (GPT2) models, in both post‑training (one‑shot) and gradual pruning settings
Practical pipeline that yields a family of compressed models (multiple speedups) in one run and is compatible with unstructured pruning and quantization
Key Findings
ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.
ZipLM reaches industry 99% accuracy recovery at larger speedups than prior methods.
ZipLM generates a full family of compressed models much cheaper than CoFi.
ZipGPT2 can be much smaller and faster while improving quality vs DistilGPT2.
ZipLM achieves reliable speedup targets in practice.
ZipLM is robust to small calibration sets.
ZipLM improves CPU compound compression pipelines.
Results
SQuAD F1 vs CoFi/TinyBERT
Accuracy
Epochs to produce 2x–15x family (efficiency)
ZipGPT2 size and speed vs DistilGPT2
Target vs achieved speedup deviation
CPU compound compression improvement
Who Should Care
What To Try In 7 Days
Run ZipLM on a production checkpoint with a short calibration set and your target latency table to get 2x–5x models quickly
Generate a per-layer latency table (few runs) on your target device and use ZipLM to meet hard latency or throughput SLAs
For edge builds, chain ZipLM + unstructured pruning + INT8 quantization and measure end-to-end CPU latency
Optimization Features
Infra Optimization
- optimizes for device specifics (V100, A100, CPU)
- reduces end-to-end GPU hours for families of models
Model Optimization
- structured_pruning
- attention_head_pruning
- feedforward_intermediate_shrinking
- module_removal (depth reduction)
System Optimization
- compatible with GPU and CPU inference engines
- works with DeepSparse for CPU deployment
Training Optimization
- layerwise_token_distillation
- post-training one-shot pruning
- gradual pruning with in-between finetuning
Inference Optimization
- inference-aware pruning (latency table)
- speedup-targeted search
- pruning that preserves measured latency/throughput
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Benchmarks are English-only; performance in low-resource languages is untested
- Relies on calibration data and a correct latency table for the target device
- Extremely aggressive compression may still need finetuning for final accuracy
When Not To Use
- When absolute top accuracy matters and any small drop is unacceptable
- If you cannot run short latency benchmarks on the target device (no latency table)
- For non-Transformer model families not evaluated in the paper
Failure Modes
- Speedup estimates may be off if the inference stack changes; observed deviations up to 5.28%
- Very small calibration sets can reduce final accuracy unless token distillation is enabled
- Pruning can produce degenerate architectures if hardware-specific runtimes are not measured
Core Entities
Models
- BERTbase
- BERTlarge
- GPT2 (124M)
- ZipBERT
- ZipGPT2
- CoFi
- TinyBERT
- DistilGPT2
- MobileBERT
- Optimal Brain Surgeon
- SPDY
Metrics
- F1
- Accuracy
- Perplexity
- Speedup
- Latency
- Throughput
Datasets
- SQuADv1.1
- GLUE (subset: SST-2, QNLI, MNLI, QQP, CoLA, MRPC, STS-B, RTE)
- OpenWebTextCorpus
- WikiText
Benchmarks
- SQuAD
- GLUE
- MLPerf

