Overview
Production Readiness
0.85
Novelty Score
0.45
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
You can run near-FM quality models in production by distilling then pruning; this cuts serving cost and latency so ranking and generative features scale to real traffic.
Summary TLDR
This paper shows how to turn very large recommendation LLMs into compact, fast models for production. The team uses white-box knowledge distillation, structured pruning (OSSCAR), and targeted quantization to produce small language models (SLMs) that keep nearly the original accuracy while cutting model size (5–20×) and reducing latency (prefill speedups >28%). They share training recipes (two-stage distillation with forward-KL and on-policy steps), ablations on pruning schedules, quantization trade-offs (FP8 and INT4/INT8), and real deployment numbers from large-scale A/B tests and GPU benchmarks. Code and pipelines are available.
Problem Statement
Large foundation LLMs give better accuracy but are too big and slow for latency-sensitive recommendation workloads. The paper asks: how to compress and deploy smaller models that keep most quality while meeting throughput and tight latency constraints?
Main Contribution
End-to-end recipe to create SLMs for ranking and reasoning: distill → structured prune → re-distill → (optional) quantize.
Empirical ablations showing two-stage distillation (supervised then on-policy FKL) improves generative quality versus single-stage.
Demonstration of structured pruning (OSSCAR) plus KD to shrink models 20× with minimal AUC loss and concrete deployment gains (TTFT/throughput) on H100/A100 clusters.
Quantization guidance: FP8 gives best H100 latency; W4A16 (INT4 weights) needs careful calibration (QuantEase) to avoid accuracy loss.
Key Findings
You can reduce a 100B+ foundation model to a compressed SLM for online serving with modest quality loss.
Two-stage distillation (SFT then on-policy FKL) yields lower validation loss than single-stage approaches.
Knowledge distillation restores accuracy after structured pruning far better than supervised fine-tuning.
Gradual pruning plus distillation can make large pruning near-lossless.
Attention pruning reduces attention latency and overall prefill time significantly.
Distillation improved an internal generative-quality metric in production A/B.
Quantization choices depend on hardware and task: FP8 best on H100; INT4 needs QuantEase to avoid accuracy loss.
Context length increases greatly inflate latency; KV caching limits repeated prefill work when ranking multiple candidates.
Results
Model size reduction
AUC drop after pruning + KD
Prefill latency improvement from attention pruning
Generative IQM improvement in A/B
TTFT (p50) H100 FP8 vs FP16 (generative)
Accuracy
Who Should Care
What To Try In 7 Days
Run white-box KD from your best model to a smaller student and compare AUC.
Profile inference to find attention or MLP bottlenecks and target them for structured pruning.
Benchmark FP8 on H100 and INT8/INT4 on available GPUs; measure accuracy with task-specific calibration data.
Optimization Features
Token Efficiency
- prefill optimization via prefix caching
Infra Optimization
- ZeRO/ZeRO++ for distributed training
- benchmarking across H100 and A100 for quantization tradeoffs
Model Optimization
- knowledge distillation (white-box, forward-KL, on-policy)
- structured pruning (OSSCAR; MLP and attention head removal)
- quantization (FP8, W8A8, W4A16 with QuantEase/GPTQ)
System Optimization
- FlashInfer attention kernels
- SGLang radix-tree caching
- Liger Triton kernels for training
Training Optimization
- SFT
- teacher-guided re-distillation after pruning
Inference Optimization
- FP8 serving on H100
- prefix KV caching (Radix caching in SGLang)
- attention pruning to cut prefill latency
- tensor parallelism across GPUs
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments focus only on recommendation and reasoning workloads; results may not generalize to other domains.
- Sparse-attention and unstructured pruning techniques were not integrated due to serving-engine limits.
- Some core models and calibration data are internal, limiting exact reproducibility of all results.
When Not To Use
- If you cannot run teacher-student training or lack the teacher model checkpoints.
- When your workload cannot afford any small accuracy drop or you require exact behavior of the original FM.
- If your serving stack does not support FP8 or the attention kernels used (FlashInfer/SGLang).
Failure Modes
- Aggressive one-shot pruning can produce large AUC drops unless followed by KD or gradual pruning.
- INT4 quantization via naive GPTQ can degrade task accuracy substantially without QuantEase-style tuning.
- Long-context workloads (32k) can explode TTFT if KV caching is not effective.
Core Entities
Models
- internal FM (Mixtral-like MoE >100B)
- Llama-3.1-8B-Instruct
- Llama-3.2-3B-Instruct
- Qwen-2.5 1.5B student
- Qwen3 4B/8B/32B (used in distillation experiments)
Metrics
- AUC (predictive tasks)
- validation loss (generative/reasoning)
- TTFT (time to first token)
- TPOT (time per output token)
- IQM (internal quality metric)
Datasets
- internal RecSys data (in-domain)
- C4 (calibration)
- OpenThoughts (reasoning)
- AIME 2024/2025 (benchmarks)
- PIQA, ARC easy/challenge (quantization eval)
Benchmarks
- AIME 2024/2025
- PIQA
- ARC easy/challenge

