Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
4
Why It Matters For Business
A single instruction‑tuned LLM trained on ECInstruct can replace many task‑specific models, improve handling of new products, and reduce engineering cost by centralizing e‑commerce functionality in one adaptable model.
Summary TLDR
The authors release ECInstruct, a large, high-quality instruction dataset for e‑commerce (≈116.5K samples across 10 real tasks) and use it to instruction‑tune general LLMs with LoRA to produce the eCeLLM family. eCeLLM models outperform general LLMs, an existing e‑commerce LLM, and many task-specific baselines in in‑domain tests (avg +10.7%) and out‑of‑domain tests on new products (avg +9.3%). The tuned models generalize to unseen instructions and benefit from more training data; code, data, and models are public at the authors' URL.
Problem Statement
E‑commerce systems need one model family that can handle many interdependent tasks and generalize to new users and new products. Existing e‑commerce models are task‑specific and fail at cold‑start/out‑of‑domain cases. Prior LLM uses in e‑commerce are limited by small or synthetic instruction data and sparse evaluation on real tasks.
Main Contribution
ECInstruct: an open, high‑quality instruction dataset for e‑commerce covering 10 real tasks, diverse instructions (6 per task), IND and OOD splits, and ~116.5K samples.
eCeLLM: a set of instruction‑tuned e‑commerce LLMs (large/medium/small) produced by LoRA fine‑tuning of 6 base models (e.g., Llama‑2, Mistral, Flan‑T5, Phi‑2).
Comprehensive evaluation: IND and OOD tests, unseen‑instruction tests, base‑model comparison, data‑scaling study, and comparisons to GPT‑4, EcomGPT, and SoTA task models.
Practical findings: instruction diversity and larger training sets improve generalization; a single generalist model can match or beat many task‑specific models.
Key Findings
Instruction‑tuned eCeLLM models beat the best baselines on in‑domain tests by an average of 10.7%.
eCeLLM improves generalization to new products (OOD) by an average of 9.3% over best baselines on evaluated tasks.
eCeLLM‑L shows much larger gains versus GPT‑4 on these e‑commerce tasks (average ~39.6% improvement over GPT‑4 Turbo across tasks).
Training data size and instruction diversity materially improve performance; example: SR HR@1 rose from 0.085 to 0.526 for eCeLLM‑L when training size grew from 1K to 92K.
Results
IND average improvement
OOD (new products) average improvement
Average vs GPT‑4 Turbo
Who Should Care
What To Try In 7 Days
Download ECInstruct and run a quick LoRA fine‑tune on a 7B open checkpoint for one product task to measure lift versus current tooling.
Add 4–6 paraphrased instructions per task in your pipeline to test unseen‑instruction robustness.
Run an OOD test (leave one product category out) to estimate real cold‑start gains before full deployment.
Agent Features
Frameworks
- Huggingface transformers
- LoRA
Architectures
- instruction-tuned LLM
- multi-task generalist model
- LoRA
Optimization Features
Training Optimization
- LoRA
- cosine LR scheduler with 5% warmup, 3 epochs
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- ECInstruct currently covers 10 tasks but omits some emerging tasks (e.g., explanation generation) noted by the authors.
- User profiling and personalization are limited because public datasets lack user identifiers and metadata.
- Training sets are downsampled to ≤10K per task for compute reasons; scaling further would require extra compute and validation.
When Not To Use
- When you require heavy per‑user personalization that depends on private user IDs and full histories.
- When your product catalog is multimodal (images/video) and the model needs image understanding (this work is text‑only).
- When strict regulatory or privacy rules forbid sending product/user text to third‑party or public checkpoints.
Failure Modes
- Formatting / output parsing failures in structured tasks (reported #failed cases for AVE and SR).
- Residual hallucination in generative answers may appear; AG task was manually checked on a sample.
- Domain bias from Amazon‑centric data and held‑out categories may not reflect other marketplaces.
Core Entities
Models
- eCeLLM-L
- eCeLLM-M
- eCeLLM-S
- Llama-2 13B-chat
- Mistral-7B Instruct-v0.2
- Flan-T5 XXL
- Phi-2
Metrics
- F1
- Macro F1
- HR@1
- Accuracy
- NDCG
- F_BERT (BERTScore)
Datasets
- ECInstruct
- Amazon Review 2018
- AmazonQA
- Shopping Queries Dataset
- Amazon-Google Products
- MAVE
Benchmarks
- ECInstruct IND
- ECInstruct OOD (new products)
- Unseen instruction split
Context Entities
Models
- GPT-4 Turbo
- Gemini Pro
- Claude 2.1
- EcomGPT
- SoTA task-specific models (BERT, DeBERTaV3, gSASRec, RGCN, SUOpenTag, AVEQA)
Metrics
- BERTScore
- BLEURT
Datasets
- Amazon Review 2014
- Amazon Review 2018 (categories)
- Shopping Queries (ESCI)
- Amazon-Google product matching
Benchmarks
- MAVE
- AmazonQA evaluations

