OD-LLM: SVD + token normalization to run LLM recommenders on-device at half size

January 14, 20267 min

Overview

Decision SnapshotReady For Pilot

The method shows practical compression and runtime gains on three public datasets using a real LLaMA-7B fine-tuning pipeline, but results are limited to Amazon subsets and a single GPU/CPU testbed.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Xin Xia, Hongzhi Yin, Shane Culpepper

Links

Abstract / PDF / Data

Why It Matters For Business

OD-LLM trims LLM memory in half while keeping ranking quality and cutting inference time, making on-device personalization feasible for latency- and privacy-sensitive apps.

Who Should Care

Summary TLDR

This paper introduces OD-LLM, a three-step compression pipeline to run LLM-based sequential recommenders on edge devices. It applies token covariance normalization (Cholesky), low-rank SVD truncation, and a progressive layer-wise alignment update. On three Amazon datasets, a 50% compressed LLaMA-7B fine-tuned recommender matches or improves ranking metrics vs the uncompressed LC-Rec baseline while running substantially faster than quantization/pruning baselines.

Problem Statement

Large LLMs improve sequential recommendation but are too big and slow for on-device use. Existing compression methods target general NLP and can break the fine-grained temporal and behavioral signals needed for recommendation. The paper seeks a compression approach that preserves recommendation quality while cutting memory and improving runtime for edge deployment.

Main Contribution

OD-LLM: a task-adaptive compression pipeline combining token covariance normalization, SVD-based low-rank factorization, and progressive layerwise alignment updates.

A token-level Cholesky-based normalization step that decorrelates and rescales activations to make SVD truncation more stable and information-preserving.

Key Findings

50% compressed OD-LLM matches or exceeds uncompressed LC-Rec ranking metrics on evaluated datasets.

NumbersInstruments HR@5: OD-LLM 0.0993 vs LC-Rec 0.0997; Arts HR@5: OD-LLM 0.1173 vs LC-Rec 0.1007

Practical UseYou can halve deployed model size with SVD+normalization and still keep or improve recommendation accuracy on similar datasets; try 0.5 compression first.

Evidence RefTable 2

OD-LLM runs substantially faster at inference than GPTQ quantization and SparseGPT pruning in these experiments.

NumbersGPU: OD-LLM 5s/batch vs GPTQ 17s and SparseGPT 12s; CPU: OD-LLM 200s vs GPTQ 700s

Practical UseIf latency matters on edge hardware, OD-LLM can reduce inference time versus some quantization/pruning setups—measure runtime on your target device.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HR@5 (Instruments)0.0993 (OD-LLM)0.0997 (LC-Rec)-0.0004InstrumentsTable 2: OD-LLM vs LC-RecTable 2
HR@5 (Arts)0.1173 (OD-LLM)0.1007 (LC-Rec)+0.0166ArtsTable 2: OD-LLM outperforms LC-Rec on ArtsTable 2

What To Try In 7 Days

Fine-tune a small LLaMA (or existing LC-Rec) on your dataset and run SVD-based compression with token covariance normalization at CR=0.5.

Calibrate using ~256 example sequences and measure HR/NDCG and end-to-end latency on your target device.

Compare OD-LLM inference speed and accuracy to a 4-bit quantized baseline (GPTQ) to pick the better trade-off for your product.

Optimization Features

Token Efficiency
token covariance normalization (Cholesky) to decorrelate activations
Infra Optimization
single A40 GPU compression and evaluation; CPU inference measured
Model Optimization
low-rank SVD truncationprogressive layerwise alignment updates
System Optimization
one-shot SVD with layer-wise normalization or progressive updates
Training Optimization
fine-tune LLaMA-7B with LC-Rec instruction tuningjoint training of RQ-VAE index and LLM alignment module
Inference Optimization
reduced parameter count (50% CR)faster runtime implementation vs GPTQ/SparseGPT

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use three Amazon datasets only; results may not generalize to other domains.

Experiments run on a single A40 GPU; no on-device mobile/edge hardware benchmarks provided.

When Not To Use

When you need extreme compression (<20% model size); aggressive CRs degrade accuracy.

If you cannot collect a modest calibration set (≈256), since calibration affects SVD quality.

Failure Modes

Small calibration sets cause larger accuracy drops.

One-shot SVD without normalization can remove critical sequential signals.

Core Entities

Models

LLaMA-7BLC-RecOD-LLMGPTQSparseGPT

Metrics

HR@5HR@10NDCG@5NDCG@10

Datasets

Instruments (Amazon)Games (Amazon)Arts (Amazon)