Youku-mPLUG: 10M filtered Chinese video-text pairs plus human benchmarks and models

June 7, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

4

Authors

Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

Links

Abstract / PDF

Why It Matters For Business

Youku-mPLUG provides a large, safety-filtered Chinese video-text corpus and benchmarks so teams can train or fine-tune Chinese multimodal models faster and compare results fairly.

Summary TLDR

This paper releases Youku-mPLUG: a public Chinese video-language corpus of 10 million filtered video–text pairs (from 400M raw videos) plus a 0.3–0.37M human-annotated benchmark covering retrieval, captioning, and category classification. The authors also release models (ALPRO, mPLUG-2) pre-trained on the dataset and propose mPLUG-video, a modular decoder-only model that uses a frozen LLM and trainable video encoder/abstractor. Pretraining on Youku-mPLUG gives large gains (up to +23.1% relative top-1 on category classification). mPLUG-video (2.7B) gets 80.57% top-1 and 68.9 CIDEr on the provided benchmarks. The dataset, code, and models are available on GitHub.

Problem Statement

The Chinese video-language community lacks a large, public, high-quality dataset and shared benchmarks. Existing large corpora are mostly English or proprietary, which slows model development and prevents fair comparisons for Chinese video-language models.

Main Contribution

A public Chinese video-language pre-training dataset Youku-mPLUG with 10 million high-quality video-text pairs filtered from 400M raw videos.

A human-annotated downstream benchmark (≈0.3–0.37M clips) covering video-text retrieval, video captioning, and video category classification.

Release of pre-trained models (ALPRO, mPLUG-2) and a new modular decoder-only model mPLUG-video that uses a frozen LLM and small trainable modules.

Demonstration that pretraining on Youku-mPLUG yields substantial gains (e.g., up to 23.1% relative in category top-1) and state-of-the-art benchmark results.

Key Findings

Pretraining on Youku-mPLUG substantially improves category classification.

NumbersTop-1: 63.51% -> 78.15% (+23.1% relative)

mPLUG-video (2.7B) reaches top-1 80.57% on category classification.

NumbersTop-1: 80.57% (Youku-mPLUG test)

mPLUG-video (2.7B) achieves strong captioning measured by CIDEr.

NumbersCIDEr: 68.9 on Youku-mPLUG caption test

Freezing the language model hurts retrieval performance.

NumbersRetrieval R@1: mPLUG-2 38.45 vs mPLUG-video 7.62

Dataset scale and filtering: 10M pairs filtered from 400M raw videos with safety checks.

Numbers10,000,000 pairs; filtered from 400,000,000 raw videos

Modular frozen-LLM approach keeps trainable parameters small.

NumbersOnly 1.7% trainable parameters when scaled on Bloomz

Results

Video Category Classification Top-1

Value80.57% (mPLUG-video 2.7B)

Baseline78.15% (ALPRO)

Video Captioning CIDEr

Value68.9 (mPLUG-video 2.7B)

Baseline67.7 (mPLUG-2)

Video Retrieval R@1 (V2T)

Value38.45% (mPLUG-2)

Baseline27.00% (ALPRO)

Effect of Youku-mPLUG pretraining on category Top-1

Value63.51% -> 78.15%

Baseline63.51% (no pretrain)

Who Should Care

What To Try In 7 Days

Download the dataset and benchmark subset from the repo and run a quick eval on a public checkpoint.

Fine-tune the released mPLUG-video checkpoint on your domain-specific labels using the frozen-LLM setup to save compute.

Evaluate retrieval vs generation trade-offs: test adding a contrastive head if retrieval matters for your app.

Agent Features

Frameworks

  • TimeSformer
  • CLIP (Chinese)
  • Bloomz

Architectures

  • decoder-only
  • modularized (frozen LLM + trainable encoder/abstractor)

Optimization Features

Token Efficiency

  • Reduce video sequence length with M learnable tokens

Model Optimization

  • Keep large LLM frozen and only train small modules (1.7% params reported)

System Optimization

  • Use frozen LLM to lower fine-tuning compute

Training Optimization

  • Sparse frame sampling (8 frames per clip)
  • Batch size 512, 10 pretraining epochs

Inference Optimization

  • Visual abstractor reduces video token length via learnable queries

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Data reflects language and concepts available at collection time and may miss future terms or new visuals.
  • Content skews to Chinese Internet culture and may not generalize cross-culturally.
  • Dataset excludes very long videos and long-form text, limiting use for full-length movies or long transcripts.

When Not To Use

  • When you need up-to-date cultural or temporal facts not present at collection time.
  • For very long-video understanding tasks (full movies, long transcripts).
  • If you require unfiltered raw web data (the corpus is safety-filtered).

Failure Modes

  • Freezing the language model reduces cross-modal alignment and retrieval performance.
  • Auto-generated category labels used in initial selection are imperfect (~94% historic accuracy) and require careful manual verification.
  • Cultural and time biases in training data may cause model blind spots on novel or non-Chinese content.

Core Entities

Models

  • mPLUG-video
  • mPLUG-2
  • ALPRO
  • Bloomz
  • TimeSformer
  • CLIP (Chinese)

Metrics

  • Accuracy
  • CIDEr
  • BLEU-4
  • METEOR
  • ROUGE
  • Recall@k (R@1,R@5,R@10)

Datasets

  • Youku-mPLUG
  • WebVid10M
  • HowTo100M
  • ALIVOL-10M
  • Kwai-SVC-11M
  • CREATE-10M
  • CNVid-3.5M

Benchmarks

  • Youku-mPLUG benchmark (category, retrieval, caption)
  • MSRVTT
  • VATEX
  • CREATE-210K