Task‑agnostic structured pruning for LLMs that cuts params and memory, then recovers in hours with 50K samples
You can cut ~20% of an LLM's parameters to save memory and speed without losing most zero‑shot ability, and re-tune it in hours on one GPU with modest public data—much cheaper than retraining or distillation.
Key finding
20% parameter pruning can be recovered to near-original zero-shot accuracy with limited tuning
Numbers: Pruned 20% → average accuracy 60.07; baseline 63.25 → 94.97% retained

