Train a small LLM for next-item recommendation that matches large LLMs while using ~13% of their parameters and running 6–8× faster.

May 28, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

3

Authors

Wujiang Xu, Qitian Wu, Zujie Liang, Jiaojiao Han, Xuying Ning, Yunxiao Shi, Wenfang Lin, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

You can shrink LLM-based recommenders to ~13% of original inference size and cut training/inference time by ~6–8× while keeping or slightly improving ranking quality, which reduces hardware cost and increases serving throughput.

Summary TLDR

SLMRec uses layer-wise feature distillation to train a much smaller language-model‑based sequential recommender. On large Amazon18 splits it matches or slightly beats larger LLM‑based baselines while using ~13% of their parameters and achieving ~6.6× faster training and ~8.0× faster inference. The method is simple, compatible with pruning/quantization, and backed by a short theoretical argument that multiple transformer layers can be compressed into fewer effective steps.

Problem Statement

Large LLM-based recommenders improve ranking but are too large and slow for industrial deployment. It is unclear how many layers and how much model size LLMs truly need for sequential recommendation, and whether a much smaller model can keep the gains.

Main Contribution

Empirical finding that many intermediate LLM layers are redundant for sequential recommendation; shallower models can match deeper ones on industry-scale data.

SLMRec: a simple layer-wise feature distillation recipe (cosine, L2 norm, and supervised adapter losses) to train small student LLMs from larger teacher LLMs.

Practical results showing SLMRec uses ~13% of LLMRec parameters and achieves up to 6.6× training and 8.0× inference speedups while maintaining or improving ranking metrics.

A concise theoretical argument that multi-layer attention propagation can be approximated by fewer effective steps, motivating layer compression and distillation.

Key Findings

Many transformer decoder layers are redundant for sequential recommendation.

SLMRec matches or slightly outperforms larger LLM-based recommenders while using far fewer parameters.

NumbersInference params 0.944B vs 6.631B (≈14%); training params 0.003B vs 0.023B (≈13%)

SLMRec greatly reduces runtime compared to LLM baselines.

NumbersTraining 6.6× faster; Inference 8.0× faster vs E4SRec

Layer-wise feature distillation and three regularizers improve student performance.

NumbersCloth MRR: 18.17 → 18.74 after adding all regularizers (+0.57 absolute)

Results

Parameter reduction (inference)

Value0.944B vs 6.631B (≈14%)

BaselineE4SRec

Parameter reduction (training)

Value0.003B vs 0.023B (≈13%)

BaselineE4SRec

Training speedup

Value6.6× faster

BaselineE4SRec

Inference speedup

Value8.0× faster

BaselineE4SRec

Recommendation quality example (Movie MRR)

Value20.36%

BaselineE4SRec (Movie MRR 19.74%)

Ablation: feature regularizers effect (Cloth MRR)

Value18.74%

BaselineSLMRec without all regularizers 18.17%

Who Should Care

What To Try In 7 Days

Run a depth-pruning probe on your LLM-based recommender: compare retaining 4–8 decoder layers vs full model.

Implement layer-wise feature distillation (cosine + L2 + small supervised adapter loss) from your current model into a smaller student.

Combine distillation with LoRA and a quantized/ pruned student to measure end-to-end memory and latency savings on a dev dataset.

Optimization Features

Infra Optimization

  • Reported A100 runtime numbers for realistic comparison

Model Optimization

  • Layer-wise distillation
  • Depth pruning

System Optimization

  • LoRA

Training Optimization

  • Offline knowledge distillation
  • Online KD (explored experimentally)

Inference Optimization

  • Smaller student model for faster scoring
  • Compatible with pruning and quantization

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations are on Amazon18 splits; behavior on other domains or live traffic is untested in this paper.
  • Model cannot do few-shot adaptation; authors state full retraining is required for new datasets.
  • Generative LLM ranking (Open-P5) and long candidate lists remain impractical despite training-time feasibility.

When Not To Use

  • If you need few-shot adaptation and prompt-based transfer without retraining.
  • When your application requires full generative ranking over very large candidate pools (generation methods are slow).
  • If the teacher model encodes task knowledge that cannot be captured by representation matching alone.

Failure Modes

  • Student fails to match teacher when teacher representations encode non-transferable or highly task-specific patterns.
  • Domain shift: distillation on one dataset may not transfer to a new item/user distribution without retraining.
  • Large candidate generation with generative LLMs becomes massively slow for ranking tasks (reported hours).

Core Entities

Models

  • SLMRec
  • E4SRec
  • Open-P5
  • LLaMa-7B
  • SASRec
  • SASRec (baseline)
  • E4SRec 4
  • E4SRec 8
  • E4SRec 32

Metrics

  • Training time (hours)
  • Inference time (hours)
  • Parameter count (B)
  • Relative speedup

Datasets

  • Amazon18 (Cloth, Movie, Music, Sport)

Benchmarks

  • Hit Rate (HR@1, HR@5)
  • NDCG@5
  • MRR