Prune 50–60% of GPT-scale weights in one pass, no retraining, with minor accuracy loss
SparseGPT can cut model memory and inference compute roughly in half for massive GPT models, enabling cheaper hosting and faster inference without retraining. Joint sparsity+quantization can match lower-bit storage with better accuracy than pure quantization.
Key finding
Large GPT models can be pruned to 50–60% unstructured sparsity in one shot with little accuracy loss.
Numbers: 50–60% sparsity; removes ≈100B weights from OPT-175B/BLOOM-176B

