Practical recipes to shrink large LLMs 5–20× and serve them with major latency wins

February 20, 20259 min

Overview

Decision SnapshotReady For Pilot

The paper provides deployed A/B results, hardware benchmarks, and ablations across distillation, pruning, and quantization, so the methods are practical but rely on internal data and heavy infra.

Citations1

Evidence Strength0.80

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 4/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 85%

Novelty: 45%

Authors

Kayhan Behdin, Ata Fatahibaarzi, Qingquan Song, Yun Dai, Aman Gupta, Zhipeng Wang, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Vignesh Kothapalli, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Natesh Pillai, Luke Simon, Rahul Mazumder

Links

Abstract / PDF / Code

Why It Matters For Business

You can run near-FM quality models in production by distilling then pruning; this cuts serving cost and latency so ranking and generative features scale to real traffic.

Who Should Care

Summary TLDR

This paper shows how to turn very large recommendation LLMs into compact, fast models for production. The team uses white-box knowledge distillation, structured pruning (OSSCAR), and targeted quantization to produce small language models (SLMs) that keep nearly the original accuracy while cutting model size (5–20×) and reducing latency (prefill speedups >28%). They share training recipes (two-stage distillation with forward-KL and on-policy steps), ablations on pruning schedules, quantization trade-offs (FP8 and INT4/INT8), and real deployment numbers from large-scale A/B tests and GPU benchmarks. Code and pipelines are available.

Problem Statement

Large foundation LLMs give better accuracy but are too big and slow for latency-sensitive recommendation workloads. The paper asks: how to compress and deploy smaller models that keep most quality while meeting throughput and tight latency constraints?

Main Contribution

End-to-end recipe to create SLMs for ranking and reasoning: distill → structured prune → re-distill → (optional) quantize.

Empirical ablations showing two-stage distillation (supervised then on-policy FKL) improves generative quality versus single-stage.

Key Findings

You can reduce a 100B+ foundation model to a compressed SLM for online serving with modest quality loss.

Numbersmodel size reduced >20× (Abstract)

Practical UseWhen serving ranking tasks, aim to distill then prune so you can run models online instead of offline-heavy FMs.

Evidence RefAbstract; Introduction

Two-stage distillation (SFT then on-policy FKL) yields lower validation loss than single-stage approaches.

Numbersbest val loss 0.1863 (FKL 14B1.5B, Table 1)

Practical UseUse a two-stage pipeline (supervised fine-tune then on-policy forward-KL) to get the best student generative quality.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Model size reduction>20×foundation FM (>100B)overall RecSys pipelinePaper reports >20× compression enabling online servingAbstract; Introduction
AUC drop after pruning + KD-0.06%8B distilled modelvs SFT -0.47%in-domain ranking tasks6.4B pruned + distillation nearly recovers AUCTable 2

What To Try In 7 Days

Run white-box KD from your best model to a smaller student and compare AUC.

Profile inference to find attention or MLP bottlenecks and target them for structured pruning.

Benchmark FP8 on H100 and INT8/INT4 on available GPUs; measure accuracy with task-specific calibration data.

Optimization Features

Token Efficiency
prefill optimization via prefix caching
Infra Optimization
ZeRO/ZeRO++ for distributed trainingbenchmarking across H100 and A100 for quantization tradeoffs
Model Optimization
knowledge distillation (white-box, forward-KL, on-policy)structured pruning (OSSCAR; MLP and attention head removal)quantization (FP8, W8A8, W4A16 with QuantEase/GPTQ)
System Optimization
FlashInfer attention kernelsSGLang radix-tree cachingLiger Triton kernels for training
Training Optimization
SFTteacher-guided re-distillation after pruning
Inference Optimization
FP8 serving on H100prefix KV caching (Radix caching in SGLang)attention pruning to cut prefill latencytensor parallelism across GPUs

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments focus only on recommendation and reasoning workloads; results may not generalize to other domains.

Sparse-attention and unstructured pruning techniques were not integrated due to serving-engine limits.

When Not To Use

If you cannot run teacher-student training or lack the teacher model checkpoints.

When your workload cannot afford any small accuracy drop or you require exact behavior of the original FM.

Failure Modes

Aggressive one-shot pruning can produce large AUC drops unless followed by KD or gradual pruning.

INT4 quantization via naive GPTQ can degrade task accuracy substantially without QuantEase-style tuning.

Core Entities

Models

internal FM (Mixtral-like MoE >100B)Llama-3.1-8B-InstructLlama-3.2-3B-InstructQwen-2.5 1.5B studentQwen3 4B/8B/32B (used in distillation experiments)

Metrics

AUC (predictive tasks)validation loss (generative/reasoning)TTFT (time to first token)TPOT (time per output token)IQM (internal quality metric)

Datasets

internal RecSys data (in-domain)C4 (calibration)OpenThoughts (reasoning)AIME 2024/2025 (benchmarks)PIQA, ARC easy/challenge (quantization eval)

Benchmarks

AIME 2024/2025PIQAARC easy/challenge