Overview
Experiments cover encoder and decoder models, multiple GPUs and CPUs, and show consistent gains and runtime guarantees; code is public for reproduction.
Citations7
Evidence Strength0.90
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 7/7
Findings with evidence refs: 7/7
Results with explicit delta: 6/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 40%
Why It Matters For Business
ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.
Who Should Care
Summary TLDR
ZipLM is a practical structured-pruning method for Transformer language models that picks which attention heads, feed‑forward columns, or entire modules to remove based on both the accuracy loss and real measured runtime. It builds a per-layer latency lookup table, prunes one structure at a time with second-order (Hessian) information, and adds a simple token-level distillation loss. Results: better accuracy-vs-speedup than CoFi/TinyBERT on BERT tasks, matches MobileBERT by pruning BERTlarge, and compresses GPT2 to be up to 60% smaller and ~30% faster than DistilGPT2 on evaluated benchmarks. ZipLM also produces a family of models for multiple target speedups in one run and keeps measured vs.
Problem Statement
Large Transformer models give strong accuracy but high inference cost. Structured pruning (remove heads/columns/modules) is attractive for real speedups but is fragile: prior methods either need expensive retraining, ignore runtime differences across hardware, or require manual distillation layer mapping. We need a pruning method that is accurate, fast to run, works post‑training or gradually, and guarantees real runtime speedups in a target environment.
Main Contribution
Inference‑aware structured pruning algorithm that jointly uses weight magnitude, activation influence, and redundancy (Hessian) and prunes one structure at a time to respect correlations
Per‑layer latency table and search to meet explicit speedup/latency targets in a given inference environment
Key Findings
ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.
ZipLM reaches industry 99% accuracy recovery at larger speedups than prior methods.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| SQuAD F1 vs CoFi/TinyBERT | +3 F1 points at same speedup (ZipLM > CoFi/TinyBERT) | CoFi / TinyBERT | +3 F1 | SQuAD dev | Figure 2; Table 5 | Section 4.1 |
| Accuracy | 5x (SQuAD), 6x (QNLI/MNLI), 13x (SST-2), 15x (QQP) | dense BERTbase | meets 99% recovery at these speedups | various dev sets | 99% recovery paragraph in Section 4.1 | Section 4.1 |
What To Try In 7 Days
Run ZipLM on a production checkpoint with a short calibration set and your target latency table to get 2x–5x models quickly
Generate a per-layer latency table (few runs) on your target device and use ZipLM to meet hard latency or throughput SLAs
For edge builds, chain ZipLM + unstructured pruning + INT8 quantization and measure end-to-end CPU latency
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Benchmarks are English-only; performance in low-resource languages is untested
Relies on calibration data and a correct latency table for the target device
When Not To Use
When absolute top accuracy matters and any small drop is unacceptable
If you cannot run short latency benchmarks on the target device (no latency table)
Failure Modes
Speedup estimates may be off if the inference stack changes; observed deviations up to 5.28%
Very small calibration sets can reduce final accuracy unless token distillation is enabled

