Turn images and product text into millions of well‑matched SEO landing pages using VLM + LLM + CLIP

March 1, 20257 min

Overview

Production Readiness

0.9

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Faye Zhang, Jasmine Wan, Qianyu Cheng, Jinfeng Rao

Links

Abstract / PDF

Why It Matters For Business

Automate landing-page generation from content to expand topic coverage, improve collection relevance, and increase organic search indexing with less manual curation.

Summary TLDR

PinLanding is a production system that builds keyword landing pages (KLPs) by first extracting attributes from product images and text (using GPT-4V), consolidating and filtering those attributes, and then training a CLIP-style dual encoder to match products to attributes at web scale. The system auto-generates natural-language collection titles with GPT-4 and assembles feeds with attribute-overlap matching on Apache Spark. In production it created 4.2M shopping pages, increased topic coverage 4×, improved human-evaluated collection precision by 14.29% over search-log baselines, and scored 99.7% Recall@10 on Fashion200K.

Problem Statement

Manual curation and search-log approaches either don't scale to millions of topical landing pages or miss content and produce imprecise collections. Platforms need a scalable way to create high-precision, searchable collections directly from content rather than relying on user queries.

Main Contribution

Content-first pipeline that derives landing page topics from product content rather than user search logs.

Two-phase, multi-modal system: (1) VLM (GPT-4V) for free-form attribute extraction and human+LLM curation; (2) CLIP-style dual encoder for scalable product-to-attribute matching.

Automated query (collection title) synthesis with GPT-4 and distributed attribute-based feed generation on Apache Spark, deployed to 4.2M landing pages.

Key Findings

Very high attribute retrieval on a public benchmark

NumbersRecall@10 = 99.7% on Fashion200K (Table 2, Sec 4.2.1)

Better collection precision than search-log approach

NumbersAverage precision@10 improved from 0.84 to 0.96 (+14.29%) (Table 3, Sec 4.2.2)

Large production scale and SEO impact

Numbers4.2M landing pages produced; 4× topic coverage; +35% search engine index rate (Sec 4.2.3)

Results

Recall@10

Value99.7%

Baselinevarious prior models 40.0–71.4%

Average collection precision@10

Value0.96

Baseline0.84 (search-log approach)

Production landing pages

Value4.2M pages

Baseline≈1M from search-log approach

Search engine index rate

Value+35%

BaselinePinterest baseline

Processing speed improvement

Value92% reduction in processing time

Baselineprevious unoptimized distributed matching

Who Should Care

What To Try In 7 Days

Run GPT-4V (or other VLM) on a small catalog slice to extract candidate attributes.

Curate a compact attribute vocabulary (frequency + deduplication) and sample-check for bias.

Train a lightweight CLIP dual-encoder on the labeled slice and build a few pilot KLPs to measure precision@10 and indexability.

Optimization Features

Infra Optimization

  • using 8 A100 GPUs for 12-hour training runs
  • memory caching of frequent mappings

Model Optimization

  • CLIP dual-encoder fine-tuning
  • frequency-based attribute reweighting to correct long-tail smoothing

System Optimization

  • distributed matching on Apache Spark
  • data partitioning and join optimization
  • minimum-product thresholds per collection to guarantee quality

Training Optimization

  • FusedAdam optimizer for memory and speed
  • pretraining initialization from CLIP encoders

Inference Optimization

  • attribute score caching for O(1) lookups
  • batching and grouped product processing

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Relies on a VLM (GPT-4V) and commercial LLMs; reproducibility is limited without access to these models.
  • Attribute-based method struggles to capture emergent cultural or trend-based concepts that are not decomposable into fixed attributes.
  • Human review was applied to a subset (~4,000 attributes), leaving open the risk of unchecked biases in long-tail attributes.

When Not To Use

  • If you need to surface emergent social trends or highly abstract style labels that require discourse signals.
  • When proprietary VLM/LLM access is unavailable or cost-prohibitive.
  • For small catalogs where manual curation is already low-cost and highly accurate.

Failure Modes

  • Overfitting to merchant-provided text or image styles, causing poor generalization to different merchandising formats.
  • Long-tail attribute misweighting if frequency reweighting hyperparameters are not tuned for new domains.
  • LLM/VLM hallucinated attributes that pass automated filters but introduce biased or unsafe labels.

Core Entities

Models

  • GPT-4V
  • GPT-4
  • CLIP (dual-encoder)
  • CLIPH/14 (1.3B)
  • ViT (vision encoder)
  • BERT

Metrics

  • Recall@10
  • Precision@10
  • search engine index rate
  • topic coverage increase

Datasets

  • Fashion200K
  • internal 200k fashion product dataset
  • production catalog (millions of product pins)

Benchmarks

  • Fashion200K