ECInstruct dataset + eCeLLM models: instruction-tuned LLMs that beat GPT‑4 on many e‑commerce tasks

February 13, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

4

Authors

Bo Peng, Xinyi Ling, Ziru Chen, Huan Sun, Xia Ning

Links

Abstract / PDF

Why It Matters For Business

A single instruction‑tuned LLM trained on ECInstruct can replace many task‑specific models, improve handling of new products, and reduce engineering cost by centralizing e‑commerce functionality in one adaptable model.

Summary TLDR

The authors release ECInstruct, a large, high-quality instruction dataset for e‑commerce (≈116.5K samples across 10 real tasks) and use it to instruction‑tune general LLMs with LoRA to produce the eCeLLM family. eCeLLM models outperform general LLMs, an existing e‑commerce LLM, and many task-specific baselines in in‑domain tests (avg +10.7%) and out‑of‑domain tests on new products (avg +9.3%). The tuned models generalize to unseen instructions and benefit from more training data; code, data, and models are public at the authors' URL.

Problem Statement

E‑commerce systems need one model family that can handle many interdependent tasks and generalize to new users and new products. Existing e‑commerce models are task‑specific and fail at cold‑start/out‑of‑domain cases. Prior LLM uses in e‑commerce are limited by small or synthetic instruction data and sparse evaluation on real tasks.

Main Contribution

ECInstruct: an open, high‑quality instruction dataset for e‑commerce covering 10 real tasks, diverse instructions (6 per task), IND and OOD splits, and ~116.5K samples.

eCeLLM: a set of instruction‑tuned e‑commerce LLMs (large/medium/small) produced by LoRA fine‑tuning of 6 base models (e.g., Llama‑2, Mistral, Flan‑T5, Phi‑2).

Comprehensive evaluation: IND and OOD tests, unseen‑instruction tests, base‑model comparison, data‑scaling study, and comparisons to GPT‑4, EcomGPT, and SoTA task models.

Practical findings: instruction diversity and larger training sets improve generalization; a single generalist model can match or beat many task‑specific models.

Key Findings

Instruction‑tuned eCeLLM models beat the best baselines on in‑domain tests by an average of 10.7%.

NumbersIND average improvement = 10.7% (Table 3)

eCeLLM improves generalization to new products (OOD) by an average of 9.3% over best baselines on evaluated tasks.

NumbersOOD average improvement = 9.3% (Table 4)

eCeLLM‑L shows much larger gains versus GPT‑4 on these e‑commerce tasks (average ~39.6% improvement over GPT‑4 Turbo across tasks).

NumbersAvg vs GPT‑4 = +39.6% (Section 6.1)

Training data size and instruction diversity materially improve performance; example: SR HR@1 rose from 0.085 to 0.526 for eCeLLM‑L when training size grew from 1K to 92K.

NumbersSR HR@1: 0.085 → 0.526 (1K→92K) (Figure 2 / Section 6.5)

Results

IND average improvement

Value10.7% average

Baselinebest baseline per task (general LLMs, EcomGPT, SoTA task models)

OOD (new products) average improvement

Value9.3% average

Baselinebest baseline per task

Average vs GPT‑4 Turbo

Value≈39.6% average improvement

BaselineGPT‑4 Turbo

Who Should Care

What To Try In 7 Days

Download ECInstruct and run a quick LoRA fine‑tune on a 7B open checkpoint for one product task to measure lift versus current tooling.

Add 4–6 paraphrased instructions per task in your pipeline to test unseen‑instruction robustness.

Run an OOD test (leave one product category out) to estimate real cold‑start gains before full deployment.

Agent Features

Frameworks

  • Huggingface transformers
  • LoRA

Architectures

  • instruction-tuned LLM
  • multi-task generalist model
  • LoRA

Optimization Features

Training Optimization

  • LoRA
  • cosine LR scheduler with 5% warmup, 3 epochs

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • ECInstruct currently covers 10 tasks but omits some emerging tasks (e.g., explanation generation) noted by the authors.
  • User profiling and personalization are limited because public datasets lack user identifiers and metadata.
  • Training sets are downsampled to ≤10K per task for compute reasons; scaling further would require extra compute and validation.

When Not To Use

  • When you require heavy per‑user personalization that depends on private user IDs and full histories.
  • When your product catalog is multimodal (images/video) and the model needs image understanding (this work is text‑only).
  • When strict regulatory or privacy rules forbid sending product/user text to third‑party or public checkpoints.

Failure Modes

  • Formatting / output parsing failures in structured tasks (reported #failed cases for AVE and SR).
  • Residual hallucination in generative answers may appear; AG task was manually checked on a sample.
  • Domain bias from Amazon‑centric data and held‑out categories may not reflect other marketplaces.

Core Entities

Models

  • eCeLLM-L
  • eCeLLM-M
  • eCeLLM-S
  • Llama-2 13B-chat
  • Mistral-7B Instruct-v0.2
  • Flan-T5 XXL
  • Phi-2

Metrics

  • F1
  • Macro F1
  • HR@1
  • Accuracy
  • NDCG
  • F_BERT (BERTScore)

Datasets

  • ECInstruct
  • Amazon Review 2018
  • AmazonQA
  • Shopping Queries Dataset
  • Amazon-Google Products
  • MAVE

Benchmarks

  • ECInstruct IND
  • ECInstruct OOD (new products)
  • Unseen instruction split

Context Entities

Models

  • GPT-4 Turbo
  • Gemini Pro
  • Claude 2.1
  • EcomGPT
  • SoTA task-specific models (BERT, DeBERTaV3, gSASRec, RGCN, SUOpenTag, AVEQA)

Metrics

  • BERTScore
  • BLEURT

Datasets

  • Amazon Review 2014
  • Amazon Review 2018 (categories)
  • Shopping Queries (ESCI)
  • Amazon-Google product matching

Benchmarks

  • MAVE
  • AmazonQA evaluations