Overview
The method shows practical compression and runtime gains on three public datasets using a real LLaMA-7B fine-tuning pipeline, but results are limited to Amazon subsets and a single GPU/CPU testbed.
Citations0
Evidence Strength0.75
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
OD-LLM trims LLM memory in half while keeping ranking quality and cutting inference time, making on-device personalization feasible for latency- and privacy-sensitive apps.
Who Should Care
Summary TLDR
This paper introduces OD-LLM, a three-step compression pipeline to run LLM-based sequential recommenders on edge devices. It applies token covariance normalization (Cholesky), low-rank SVD truncation, and a progressive layer-wise alignment update. On three Amazon datasets, a 50% compressed LLaMA-7B fine-tuned recommender matches or improves ranking metrics vs the uncompressed LC-Rec baseline while running substantially faster than quantization/pruning baselines.
Problem Statement
Large LLMs improve sequential recommendation but are too big and slow for on-device use. Existing compression methods target general NLP and can break the fine-grained temporal and behavioral signals needed for recommendation. The paper seeks a compression approach that preserves recommendation quality while cutting memory and improving runtime for edge deployment.
Main Contribution
OD-LLM: a task-adaptive compression pipeline combining token covariance normalization, SVD-based low-rank factorization, and progressive layerwise alignment updates.
A token-level Cholesky-based normalization step that decorrelates and rescales activations to make SVD truncation more stable and information-preserving.
Key Findings
50% compressed OD-LLM matches or exceeds uncompressed LC-Rec ranking metrics on evaluated datasets.
OD-LLM runs substantially faster at inference than GPTQ quantization and SparseGPT pruning in these experiments.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| HR@5 (Instruments) | 0.0993 (OD-LLM) | 0.0997 (LC-Rec) | -0.0004 | Instruments | Table 2: OD-LLM vs LC-Rec | Table 2 |
| HR@5 (Arts) | 0.1173 (OD-LLM) | 0.1007 (LC-Rec) | +0.0166 | Arts | Table 2: OD-LLM outperforms LC-Rec on Arts | Table 2 |
What To Try In 7 Days
Fine-tune a small LLaMA (or existing LC-Rec) on your dataset and run SVD-based compression with token covariance normalization at CR=0.5.
Calibrate using ~256 example sequences and measure HR/NDCG and end-to-end latency on your target device.
Compare OD-LLM inference speed and accuracy to a 4-bit quantized baseline (GPTQ) to pick the better trade-off for your product.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluations use three Amazon datasets only; results may not generalize to other domains.
Experiments run on a single A40 GPU; no on-device mobile/edge hardware benchmarks provided.
When Not To Use
When you need extreme compression (<20% model size); aggressive CRs degrade accuracy.
If you cannot collect a modest calibration set (≈256), since calibration affects SVD quality.
Failure Modes
Small calibration sets cause larger accuracy drops.
One-shot SVD without normalization can remove critical sequential signals.

