ECInstruct dataset + eCeLLM models: instruction-tuned LLMs that beat GPT‑4 on many e‑commerce tasks

February 13, 20248 min

Overview

Decision SnapshotReady For Pilot

The dataset and models are public and show consistent gains on many tasks; training uses LoRA which lowers tuning cost, but full deployment still needs evaluation on your catalog and privacy constraints.

Citations4

Evidence Strength0.80

Confidence0.87

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Bo Peng, Xinyi Ling, Ziru Chen, Huan Sun, Xia Ning

Links

Abstract / PDF / Code / Data

Why It Matters For Business

A single instruction‑tuned LLM trained on ECInstruct can replace many task‑specific models, improve handling of new products, and reduce engineering cost by centralizing e‑commerce functionality in one adaptable model.

Who Should Care

Summary TLDR

The authors release ECInstruct, a large, high-quality instruction dataset for e‑commerce (≈116.5K samples across 10 real tasks) and use it to instruction‑tune general LLMs with LoRA to produce the eCeLLM family. eCeLLM models outperform general LLMs, an existing e‑commerce LLM, and many task-specific baselines in in‑domain tests (avg +10.7%) and out‑of‑domain tests on new products (avg +9.3%). The tuned models generalize to unseen instructions and benefit from more training data; code, data, and models are public at the authors' URL.

Problem Statement

E‑commerce systems need one model family that can handle many interdependent tasks and generalize to new users and new products. Existing e‑commerce models are task‑specific and fail at cold‑start/out‑of‑domain cases. Prior LLM uses in e‑commerce are limited by small or synthetic instruction data and sparse evaluation on real tasks.

Main Contribution

ECInstruct: an open, high‑quality instruction dataset for e‑commerce covering 10 real tasks, diverse instructions (6 per task), IND and OOD splits, and ~116.5K samples.

eCeLLM: a set of instruction‑tuned e‑commerce LLMs (large/medium/small) produced by LoRA fine‑tuning of 6 base models (e.g., Llama‑2, Mistral, Flan‑T5, Phi‑2).

Key Findings

Instruction‑tuned eCeLLM models beat the best baselines on in‑domain tests by an average of 10.7%.

NumbersIND average improvement = 10.7% (Table 3)

Practical UseIf you fine‑tune a general LLM on a broad, high‑quality e‑commerce instruction dataset, expect measurable accuracy gains versus off‑the‑shelf or task‑specific models on held‑in‑domain tasks.

Evidence RefTable 3, Section 6.1

eCeLLM improves generalization to new products (OOD) by an average of 9.3% over best baselines on evaluated tasks.

NumbersOOD average improvement = 9.3% (Table 4)

Practical UseInstruction tuning on ECInstruct can reduce cold‑start failures for new products; deploy a tuned model to better handle unseen product categories.

Evidence RefTable 4, Section 6.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
IND average improvement10.7% averagebest baseline per task (general LLMs, EcomGPT, SoTA task models)10.7%ECInstruct IND (10 tasks)Table 3; Section 6.1Table 3
OOD (new products) average improvement9.3% averagebest baseline per task9.3%ECInstruct OOD (6 tasks)Table 4; Section 6.2Table 4

What To Try In 7 Days

Download ECInstruct and run a quick LoRA fine‑tune on a 7B open checkpoint for one product task to measure lift versus current tooling.

Add 4–6 paraphrased instructions per task in your pipeline to test unseen‑instruction robustness.

Run an OOD test (leave one product category out) to estimate real cold‑start gains before full deployment.

Agent Features

Frameworks
Huggingface transformersLoRA
Architectures
instruction-tuned LLMmulti-task generalist modelLoRA

Optimization Features

Training Optimization
LoRAcosine LR scheduler with 5% warmup, 3 epochs

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

ECInstruct currently covers 10 tasks but omits some emerging tasks (e.g., explanation generation) noted by the authors.

User profiling and personalization are limited because public datasets lack user identifiers and metadata.

When Not To Use

When you require heavy per‑user personalization that depends on private user IDs and full histories.

When your product catalog is multimodal (images/video) and the model needs image understanding (this work is text‑only).

Failure Modes

Formatting / output parsing failures in structured tasks (reported #failed cases for AVE and SR).

Residual hallucination in generative answers may appear; AG task was manually checked on a sample.

Core Entities

Models

eCeLLM-LeCeLLM-MeCeLLM-SLlama-2 13B-chatMistral-7B Instruct-v0.2Flan-T5 XXLPhi-2

Metrics

F1Macro F1HR@1AccuracyNDCGF_BERT (BERTScore)

Datasets

ECInstructAmazon Review 2018AmazonQAShopping Queries DatasetAmazon-Google ProductsMAVE

Benchmarks

ECInstruct INDECInstruct OOD (new products)Unseen instruction split

Context Entities

Models

GPT-4 TurboGemini ProClaude 2.1EcomGPTSoTA task-specific models (BERT, DeBERTaV3, gSASRec, RGCN, SUOpenTag, AVEQA)

Metrics

BERTScoreBLEURT

Datasets

Amazon Review 2014Amazon Review 2018 (categories)Shopping Queries (ESCI)Amazon-Google product matching

Benchmarks

MAVEAmazonQA evaluations