Practical recipes to shrink large LLMs 5–20× and serve them with major latency wins

February 20, 20259 min

Overview

Production Readiness

0.85

Novelty Score

0.45

Cost Impact Score

0.8

Citation Count

1

Authors

Kayhan Behdin, Ata Fatahibaarzi, Qingquan Song, Yun Dai, Aman Gupta, Zhipeng Wang, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Vignesh Kothapalli, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Natesh Pillai, Luke Simon, Rahul Mazumder

Links

Abstract / PDF

Why It Matters For Business

You can run near-FM quality models in production by distilling then pruning; this cuts serving cost and latency so ranking and generative features scale to real traffic.

Summary TLDR

This paper shows how to turn very large recommendation LLMs into compact, fast models for production. The team uses white-box knowledge distillation, structured pruning (OSSCAR), and targeted quantization to produce small language models (SLMs) that keep nearly the original accuracy while cutting model size (5–20×) and reducing latency (prefill speedups >28%). They share training recipes (two-stage distillation with forward-KL and on-policy steps), ablations on pruning schedules, quantization trade-offs (FP8 and INT4/INT8), and real deployment numbers from large-scale A/B tests and GPU benchmarks. Code and pipelines are available.

Problem Statement

Large foundation LLMs give better accuracy but are too big and slow for latency-sensitive recommendation workloads. The paper asks: how to compress and deploy smaller models that keep most quality while meeting throughput and tight latency constraints?

Main Contribution

End-to-end recipe to create SLMs for ranking and reasoning: distill → structured prune → re-distill → (optional) quantize.

Empirical ablations showing two-stage distillation (supervised then on-policy FKL) improves generative quality versus single-stage.

Demonstration of structured pruning (OSSCAR) plus KD to shrink models 20× with minimal AUC loss and concrete deployment gains (TTFT/throughput) on H100/A100 clusters.

Quantization guidance: FP8 gives best H100 latency; W4A16 (INT4 weights) needs careful calibration (QuantEase) to avoid accuracy loss.

Key Findings

You can reduce a 100B+ foundation model to a compressed SLM for online serving with modest quality loss.

Numbersmodel size reduced >20× (Abstract)

Two-stage distillation (SFT then on-policy FKL) yields lower validation loss than single-stage approaches.

Numbersbest val loss 0.1863 (FKL 14B→1.5B, Table 1)

Knowledge distillation restores accuracy after structured pruning far better than supervised fine-tuning.

Numbers6.4B pruned: SFT AUC Δ = -0.47% vs KD AUC Δ = -0.06% (Table 2)

Gradual pruning plus distillation can make large pruning near-lossless.

Numbers3B → 2.4B: one-shot Δ = -0.12% vs gradual Δ = +0.03% (Table 4)

Attention pruning reduces attention latency and overall prefill time significantly.

Numbersattention latency ≈ 40% faster → >28% prefill speedup (Section 6.1, Fig.4)

Distillation improved an internal generative-quality metric in production A/B.

NumbersIQM up +20.29% in 1% A/B test (Section 6.2)

Quantization choices depend on hardware and task: FP8 best on H100; INT4 needs QuantEase to avoid accuracy loss.

NumbersFP8 p50 TTFT 122ms vs FP16 136ms on H100; W4A16 GPTQ drops ARC-c from .5299→.436 (Tables 6,7)

Context length increases greatly inflate latency; KV caching limits repeated prefill work when ranking multiple candidates.

Numbers32k context has much higher TTFT than 16k; KV caching reduces extra cost when k>1 (Section 6.1; Tables 8–10)

Results

Model size reduction

Value>20×

Baselinefoundation FM (>100B)

AUC drop after pruning + KD

Value-0.06%

Baseline8B distilled model

Prefill latency improvement from attention pruning

Value>28% speedup

Baselineunpruned model prefill

Generative IQM improvement in A/B

Value+20.29%

Baselineprevious model

TTFT (p50) H100 FP8 vs FP16 (generative)

Value122ms vs 136ms

BaselineFP16 H100

Accuracy

ValueARC-c 0.5299 → 0.4360

BaselineFP16

Who Should Care

What To Try In 7 Days

Run white-box KD from your best model to a smaller student and compare AUC.

Profile inference to find attention or MLP bottlenecks and target them for structured pruning.

Benchmark FP8 on H100 and INT8/INT4 on available GPUs; measure accuracy with task-specific calibration data.

Optimization Features

Token Efficiency

  • prefill optimization via prefix caching

Infra Optimization

  • ZeRO/ZeRO++ for distributed training
  • benchmarking across H100 and A100 for quantization tradeoffs

Model Optimization

  • knowledge distillation (white-box, forward-KL, on-policy)
  • structured pruning (OSSCAR; MLP and attention head removal)
  • quantization (FP8, W8A8, W4A16 with QuantEase/GPTQ)

System Optimization

  • FlashInfer attention kernels
  • SGLang radix-tree caching
  • Liger Triton kernels for training

Training Optimization

  • SFT
  • teacher-guided re-distillation after pruning

Inference Optimization

  • FP8 serving on H100
  • prefix KV caching (Radix caching in SGLang)
  • attention pruning to cut prefill latency
  • tensor parallelism across GPUs

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments focus only on recommendation and reasoning workloads; results may not generalize to other domains.
  • Sparse-attention and unstructured pruning techniques were not integrated due to serving-engine limits.
  • Some core models and calibration data are internal, limiting exact reproducibility of all results.

When Not To Use

  • If you cannot run teacher-student training or lack the teacher model checkpoints.
  • When your workload cannot afford any small accuracy drop or you require exact behavior of the original FM.
  • If your serving stack does not support FP8 or the attention kernels used (FlashInfer/SGLang).

Failure Modes

  • Aggressive one-shot pruning can produce large AUC drops unless followed by KD or gradual pruning.
  • INT4 quantization via naive GPTQ can degrade task accuracy substantially without QuantEase-style tuning.
  • Long-context workloads (32k) can explode TTFT if KV caching is not effective.

Core Entities

Models

  • internal FM (Mixtral-like MoE >100B)
  • Llama-3.1-8B-Instruct
  • Llama-3.2-3B-Instruct
  • Qwen-2.5 1.5B student
  • Qwen3 4B/8B/32B (used in distillation experiments)

Metrics

  • AUC (predictive tasks)
  • validation loss (generative/reasoning)
  • TTFT (time to first token)
  • TPOT (time per output token)
  • IQM (internal quality metric)

Datasets

  • internal RecSys data (in-domain)
  • C4 (calibration)
  • OpenThoughts (reasoning)
  • AIME 2024/2025 (benchmarks)
  • PIQA, ARC easy/challenge (quantization eval)

Benchmarks

  • AIME 2024/2025
  • PIQA
  • ARC easy/challenge