OD-LLM: SVD + token normalization to run LLM recommenders on-device at half size

Overview

Decision SnapshotReady For Pilot

The method shows practical compression and runtime gains on three public datasets using a real LLaMA-7B fine-tuning pipeline, but results are limited to Amazon subsets and a single GPU/CPU testbed.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Xin Xia, Hongzhi Yin, Shane Culpepper

Links

Abstract / PDF / Data

Why It Matters For Business

OD-LLM trims LLM memory in half while keeping ranking quality and cutting inference time, making on-device personalization feasible for latency- and privacy-sensitive apps.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper introduces OD-LLM, a three-step compression pipeline to run LLM-based sequential recommenders on edge devices. It applies token covariance normalization (Cholesky), low-rank SVD truncation, and a progressive layer-wise alignment update. On three Amazon datasets, a 50% compressed LLaMA-7B fine-tuned recommender matches or improves ranking metrics vs the uncompressed LC-Rec baseline while running substantially faster than quantization/pruning baselines.

Problem Statement

Large LLMs improve sequential recommendation but are too big and slow for on-device use. Existing compression methods target general NLP and can break the fine-grained temporal and behavioral signals needed for recommendation. The paper seeks a compression approach that preserves recommendation quality while cutting memory and improving runtime for edge deployment.

Main Contribution

OD-LLM: a task-adaptive compression pipeline combining token covariance normalization, SVD-based low-rank factorization, and progressive layerwise alignment updates.

A token-level Cholesky-based normalization step that decorrelates and rescales activations to make SVD truncation more stable and information-preserving.

Key Findings

50% compressed OD-LLM matches or exceeds uncompressed LC-Rec ranking metrics on evaluated datasets.

NumbersInstruments HR@5: OD-LLM 0.0993 vs LC-Rec 0.0997; Arts HR@5: OD-LLM 0.1173 vs LC-Rec 0.1007

Practical UseYou can halve deployed model size with SVD+normalization and still keep or improve recommendation accuracy on similar datasets; try 0.5 compression first.

Evidence RefTable 2

OD-LLM runs substantially faster at inference than GPTQ quantization and SparseGPT pruning in these experiments.

NumbersGPU: OD-LLM 5s/batch vs GPTQ 17s and SparseGPT 12s; CPU: OD-LLM 200s vs GPTQ 700s

Practical UseIf latency matters on edge hardware, OD-LLM can reduce inference time versus some quantization/pruning setups—measure runtime on your target device.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HR@5 (Instruments)	0.0993 (OD-LLM)	0.0997 (LC-Rec)	-0.0004	Instruments	Table 2: OD-LLM vs LC-Rec	Table 2
HR@5 (Arts)	0.1173 (OD-LLM)	0.1007 (LC-Rec)	+0.0166	Arts	Table 2: OD-LLM outperforms LC-Rec on Arts	Table 2

What To Try In 7 Days

Fine-tune a small LLaMA (or existing LC-Rec) on your dataset and run SVD-based compression with token covariance normalization at CR=0.5.

Calibrate using ~256 example sequences and measure HR/NDCG and end-to-end latency on your target device.

Compare OD-LLM inference speed and accuracy to a 4-bit quantized baseline (GPTQ) to pick the better trade-off for your product.

Optimization Features

Token Efficiency

token covariance normalization (Cholesky) to decorrelate activations

Infra Optimization

single A40 GPU compression and evaluation; CPU inference measured

Model Optimization

low-rank SVD truncationprogressive layerwise alignment updates

System Optimization

one-shot SVD with layer-wise normalization or progressive updates

Training Optimization

fine-tune LLaMA-7B with LC-Rec instruction tuningjoint training of RQ-VAE index and LLM alignment module

Inference Optimization

reduced parameter count (50% CR)faster runtime implementation vs GPTQ/SparseGPT

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

http://jmcauley.ucsd.edu/data/amazon/

Risks & Boundaries

Limitations

Evaluations use three Amazon datasets only; results may not generalize to other domains.

Experiments run on a single A40 GPU; no on-device mobile/edge hardware benchmarks provided.

When Not To Use

When you need extreme compression (<20% model size); aggressive CRs degrade accuracy.

If you cannot collect a modest calibration set (≈256), since calibration affects SVD quality.

Failure Modes

Small calibration sets cause larger accuracy drops.

One-shot SVD without normalization can remove critical sequential signals.

Core Entities

Models

LLaMA-7BLC-RecOD-LLMGPTQSparseGPT

Metrics

HR@5HR@10NDCG@5NDCG@10

Datasets

Instruments (Amazon)Games (Amazon)Arts (Amazon)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

50% compressed OD-LLM matches or exceeds uncompressed LC-Rec ranking metrics on evaluated datasets.

OD-LLM runs substantially faster at inference than GPTQ quantization and SparseGPT pruning in these experiments.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

Train a tiny 'judge' on top of target embeddings to accept many more draft tokens and speed up large-model generation up to ~9× without loss

Key finding

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

Practical survey of quantization, pruning, distillation, and decoding tricks to make LLMs cheaper and faster

Key finding