Practical recipes to shrink large LLMs 5–20× and serve them with major latency wins

Overview

Decision SnapshotReady For Pilot

The paper provides deployed A/B results, hardware benchmarks, and ablations across distillation, pruning, and quantization, so the methods are practical but rely on internal data and heavy infra.

Citations1

Evidence Strength0.80

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 4/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 85%

Novelty: 45%

Authors

Kayhan Behdin, Ata Fatahibaarzi, Qingquan Song, Yun Dai, Aman Gupta, Zhipeng Wang, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Vignesh Kothapalli, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Natesh Pillai, Luke Simon, Rahul Mazumder

Links

Abstract / PDF / Code

Why It Matters For Business

You can run near-FM quality models in production by distilling then pruning; this cuts serving cost and latency so ranking and generative features scale to real traffic.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper shows how to turn very large recommendation LLMs into compact, fast models for production. The team uses white-box knowledge distillation, structured pruning (OSSCAR), and targeted quantization to produce small language models (SLMs) that keep nearly the original accuracy while cutting model size (5–20×) and reducing latency (prefill speedups >28%). They share training recipes (two-stage distillation with forward-KL and on-policy steps), ablations on pruning schedules, quantization trade-offs (FP8 and INT4/INT8), and real deployment numbers from large-scale A/B tests and GPU benchmarks. Code and pipelines are available.

Problem Statement

Large foundation LLMs give better accuracy but are too big and slow for latency-sensitive recommendation workloads. The paper asks: how to compress and deploy smaller models that keep most quality while meeting throughput and tight latency constraints?

Main Contribution

End-to-end recipe to create SLMs for ranking and reasoning: distill → structured prune → re-distill → (optional) quantize.

Empirical ablations showing two-stage distillation (supervised then on-policy FKL) improves generative quality versus single-stage.

Key Findings

You can reduce a 100B+ foundation model to a compressed SLM for online serving with modest quality loss.

Numbersmodel size reduced >20× (Abstract)

Practical UseWhen serving ranking tasks, aim to distill then prune so you can run models online instead of offline-heavy FMs.

Evidence RefAbstract; Introduction

Two-stage distillation (SFT then on-policy FKL) yields lower validation loss than single-stage approaches.

Numbersbest val loss 0.1863 (FKL 14B→1.5B, Table 1)

Practical UseUse a two-stage pipeline (supervised fine-tune then on-policy forward-KL) to get the best student generative quality.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Model size reduction	>20×	foundation FM (>100B)	—	overall RecSys pipeline	Paper reports >20× compression enabling online serving	Abstract; Introduction
AUC drop after pruning + KD	-0.06%	8B distilled model	vs SFT -0.47%	in-domain ranking tasks	6.4B pruned + distillation nearly recovers AUC	Table 2

What To Try In 7 Days

Run white-box KD from your best model to a smaller student and compare AUC.

Profile inference to find attention or MLP bottlenecks and target them for structured pruning.

Benchmark FP8 on H100 and INT8/INT4 on available GPUs; measure accuracy with task-specific calibration data.

Optimization Features

Token Efficiency

prefill optimization via prefix caching

Infra Optimization

ZeRO/ZeRO++ for distributed trainingbenchmarking across H100 and A100 for quantization tradeoffs

Model Optimization

knowledge distillation (white-box, forward-KL, on-policy)structured pruning (OSSCAR; MLP and attention head removal)quantization (FP8, W8A8, W4A16 with QuantEase/GPTQ)

System Optimization

FlashInfer attention kernelsSGLang radix-tree cachingLiger Triton kernels for training

Training Optimization

SFTteacher-guided re-distillation after pruning

Inference Optimization

FP8 serving on H100prefix KV caching (Radix caching in SGLang)attention pruning to cut prefill latencytensor parallelism across GPUs

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/linkedin/FMCHISEL

Risks & Boundaries

Limitations

Experiments focus only on recommendation and reasoning workloads; results may not generalize to other domains.

Sparse-attention and unstructured pruning techniques were not integrated due to serving-engine limits.

When Not To Use

If you cannot run teacher-student training or lack the teacher model checkpoints.

When your workload cannot afford any small accuracy drop or you require exact behavior of the original FM.

Failure Modes

Aggressive one-shot pruning can produce large AUC drops unless followed by KD or gradual pruning.

INT4 quantization via naive GPTQ can degrade task accuracy substantially without QuantEase-style tuning.

Core Entities

Models

internal FM (Mixtral-like MoE >100B)Llama-3.1-8B-InstructLlama-3.2-3B-InstructQwen-2.5 1.5B studentQwen3 4B/8B/32B (used in distillation experiments)

Metrics

AUC (predictive tasks)validation loss (generative/reasoning)TTFT (time to first token)TPOT (time per output token)IQM (internal quality metric)

Datasets

internal RecSys data (in-domain)C4 (calibration)OpenThoughts (reasoning)AIME 2024/2025 (benchmarks)PIQA, ARC easy/challenge (quantization eval)

Benchmarks

AIME 2024/2025PIQAARC easy/challenge

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

You can reduce a 100B+ foundation model to a compressed SLM for online serving with modest quality loss.

Two-stage distillation (SFT then on-policy FKL) yields lower validation loss than single-stage approaches.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Distill retrieval+evidence and simple graphs from big LLMs into small LMs to cut hallucinations and inference cost

Key finding

Cut Qwen2-Audio translation models by ~40–50% storage while keeping ~97–100% quality

Key finding