ZipLM: inference-aware structured pruning that gives runtime speedup guarantees across devices

Overview

Decision SnapshotReady For Pilot

Experiments cover encoder and decoder models, multiple GPUs and CPUs, and show consistent gains and runtime guarantees; code is public for reproduction.

Citations7

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 6/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 40%

Authors

Eldar Kurtic, Elias Frantar, Dan Alistarh

Links

Abstract / PDF / Code

Why It Matters For Business

ZipLM cuts inference cost and risk: it produces many valid speedup targets in one run and guarantees measured speedups on target hardware, reducing GPU/CPU time and deployment surprises.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Founder

Summary TLDR

ZipLM is a practical structured-pruning method for Transformer language models that picks which attention heads, feed‑forward columns, or entire modules to remove based on both the accuracy loss and real measured runtime. It builds a per-layer latency lookup table, prunes one structure at a time with second-order (Hessian) information, and adds a simple token-level distillation loss. Results: better accuracy-vs-speedup than CoFi/TinyBERT on BERT tasks, matches MobileBERT by pruning BERTlarge, and compresses GPT2 to be up to 60% smaller and ~30% faster than DistilGPT2 on evaluated benchmarks. ZipLM also produces a family of models for multiple target speedups in one run and keeps measured vs.

Problem Statement

Large Transformer models give strong accuracy but high inference cost. Structured pruning (remove heads/columns/modules) is attractive for real speedups but is fragile: prior methods either need expensive retraining, ignore runtime differences across hardware, or require manual distillation layer mapping. We need a pruning method that is accurate, fast to run, works post‑training or gradually, and guarantees real runtime speedups in a target environment.

Main Contribution

Inference‑aware structured pruning algorithm that jointly uses weight magnitude, activation influence, and redundancy (Hessian) and prunes one structure at a time to respect correlations

Per‑layer latency table and search to meet explicit speedup/latency targets in a given inference environment

Key Findings

ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.

Numbers≈ +3 F1 points vs CoFi at same speedup (SQuAD dev)

Practical UseIf you need a faster BERT with similar latency, try ZipLM first to get higher accuracy at the same runtime.

Evidence RefSection 4.1, Figure 2; Table 5

ZipLM reaches industry 99% accuracy recovery at larger speedups than prior methods.

NumbersBERTbase 99% recovery: 5x (SQuAD), 6x (QNLI/MNLI), 13x (SST-2), 15x (QQP)

Practical UseFor production thresholds like MLPerf's 99% accuracy, ZipLM often yields substantially faster models than alternatives.

Evidence RefSection 4.1 (99% recovery paragraph)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SQuAD F1 vs CoFi/TinyBERT	+3 F1 points at same speedup (ZipLM > CoFi/TinyBERT)	CoFi / TinyBERT	+3 F1	SQuAD dev	Figure 2; Table 5	Section 4.1
Accuracy	5x (SQuAD), 6x (QNLI/MNLI), 13x (SST-2), 15x (QQP)	dense BERTbase	meets 99% recovery at these speedups	various dev sets	99% recovery paragraph in Section 4.1	Section 4.1

What To Try In 7 Days

Run ZipLM on a production checkpoint with a short calibration set and your target latency table to get 2x–5x models quickly

Generate a per-layer latency table (few runs) on your target device and use ZipLM to meet hard latency or throughput SLAs

For edge builds, chain ZipLM + unstructured pruning + INT8 quantization and measure end-to-end CPU latency

Optimization Features

Infra Optimization

optimizes for device specifics (V100, A100, CPU)reduces end-to-end GPU hours for families of models

Model Optimization

structured_pruningattention_head_pruningfeedforward_intermediate_shrinkingmodule_removal (depth reduction)

System Optimization

compatible with GPU and CPU inference enginesworks with DeepSparse for CPU deployment

Training Optimization

layerwise_token_distillationpost-training one-shot pruninggradual pruning with in-between finetuning

Inference Optimization

inference-aware pruning (latency table)speedup-targeted searchpruning that preserves measured latency/throughput

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/IST-DASLab/ZipLM

Risks & Boundaries

Limitations

Benchmarks are English-only; performance in low-resource languages is untested

Relies on calibration data and a correct latency table for the target device

When Not To Use

When absolute top accuracy matters and any small drop is unacceptable

If you cannot run short latency benchmarks on the target device (no latency table)

Failure Modes

Speedup estimates may be off if the inference stack changes; observed deviations up to 5.28%

Very small calibration sets can reduce final accuracy unless token distillation is enabled

Core Entities

Models

BERTbaseBERTlargeGPT2 (124M)ZipBERTZipGPT2CoFiTinyBERTDistilGPT2MobileBERTOptimal Brain SurgeonSPDY

Metrics

F1AccuracyPerplexitySpeedupLatencyThroughput

Datasets

SQuADv1.1GLUE (subset: SST-2, QNLI, MNLI, QQP, CoLA, MRPC, STS-B, RTE)OpenWebTextCorpusWikiText

Benchmarks

SQuADGLUEMLPerf

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ZipLM beats CoFi and TinyBERT on SQuAD at the same speedup.

ZipLM reaches industry 99% accuracy recovery at larger speedups than prior methods.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding

Smaller, faster NLLB-based models for 15 African language pairs, with released data and code

Key finding

Compression can preserve or break LLM trust: 4-bit quantization often keeps or even improves ethics/fairness, pruning and 3-bit quantization

Key finding

Use LLM agents + runtime profiling to pick layerwise pruning and post-training dynamic quantization automatically

Key finding