Youku-mPLUG: 10M filtered Chinese video-text pairs plus human benchmarks and models

June 7, 20238 min

Overview

Decision SnapshotReady For Pilot

Dataset release plus model checkpoints and clear benchmarks make this work actionable for teams building Chinese video-language features; freezing large LLMs reduces tuning cost but can harm retrieval without extra heads.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Youku-mPLUG provides a large, safety-filtered Chinese video-text corpus and benchmarks so teams can train or fine-tune Chinese multimodal models faster and compare results fairly.

Who Should Care

Summary TLDR

This paper releases Youku-mPLUG: a public Chinese video-language corpus of 10 million filtered video–text pairs (from 400M raw videos) plus a 0.3–0.37M human-annotated benchmark covering retrieval, captioning, and category classification. The authors also release models (ALPRO, mPLUG-2) pre-trained on the dataset and propose mPLUG-video, a modular decoder-only model that uses a frozen LLM and trainable video encoder/abstractor. Pretraining on Youku-mPLUG gives large gains (up to +23.1% relative top-1 on category classification). mPLUG-video (2.7B) gets 80.57% top-1 and 68.9 CIDEr on the provided benchmarks. The dataset, code, and models are available on GitHub.

Problem Statement

The Chinese video-language community lacks a large, public, high-quality dataset and shared benchmarks. Existing large corpora are mostly English or proprietary, which slows model development and prevents fair comparisons for Chinese video-language models.

Main Contribution

A public Chinese video-language pre-training dataset Youku-mPLUG with 10 million high-quality video-text pairs filtered from 400M raw videos.

A human-annotated downstream benchmark (≈0.3–0.37M clips) covering video-text retrieval, video captioning, and video category classification.

Key Findings

Pretraining on Youku-mPLUG substantially improves category classification.

NumbersTop-1: 63.51% -> 78.15% (+23.1% relative)

Practical UseIf you pretrain on this 10M Chinese dataset, expect large accuracy gains for video category tasks compared to no pretraining.

Evidence RefTable 6; Abstract

mPLUG-video (2.7B) reaches top-1 80.57% on category classification.

NumbersTop-1: 80.57% (Youku-mPLUG test)

Practical UseUse mPLUG-video as a strong baseline for Chinese video category classification on this benchmark.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Video Category Classification Top-180.57% (mPLUG-video 2.7B)78.15% (ALPRO)+2.42% absYouku-mPLUG testTable 4 reports mPLUG-video (2.7B) Top-1 80.57% and ALPRO 78.15%Table 4
Video Captioning CIDEr68.9 (mPLUG-video 2.7B)67.7 (mPLUG-2)+1.2 CIDErYouku-mPLUG caption testTable 4 shows CIDEr 68.9 vs 67.7Table 4

What To Try In 7 Days

Download the dataset and benchmark subset from the repo and run a quick eval on a public checkpoint.

Fine-tune the released mPLUG-video checkpoint on your domain-specific labels using the frozen-LLM setup to save compute.

Evaluate retrieval vs generation trade-offs: test adding a contrastive head if retrieval matters for your app.

Agent Features

Frameworks
TimeSformerCLIP (Chinese)Bloomz
Architectures
decoder-onlymodularized (frozen LLM + trainable encoder/abstractor)

Optimization Features

Token Efficiency
Reduce video sequence length with M learnable tokens
Model Optimization
Keep large LLM frozen and only train small modules (1.7% params reported)
System Optimization
Use frozen LLM to lower fine-tuning compute
Training Optimization
Sparse frame sampling (8 frames per clip)Batch size 512, 10 pretraining epochs
Inference Optimization
Visual abstractor reduces video token length via learnable queries

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Data reflects language and concepts available at collection time and may miss future terms or new visuals.

Content skews to Chinese Internet culture and may not generalize cross-culturally.

When Not To Use

When you need up-to-date cultural or temporal facts not present at collection time.

For very long-video understanding tasks (full movies, long transcripts).

Failure Modes

Freezing the language model reduces cross-modal alignment and retrieval performance.

Auto-generated category labels used in initial selection are imperfect (~94% historic accuracy) and require careful manual verification.

Core Entities

Models

mPLUG-videomPLUG-2ALPROBloomzTimeSformerCLIP (Chinese)

Metrics

AccuracyCIDErBLEU-4METEORROUGERecall@k (R@1,R@5,R@10)

Datasets

Youku-mPLUGWebVid10MHowTo100MALIVOL-10MKwai-SVC-11MCREATE-10MCNVid-3.5M

Benchmarks

Youku-mPLUG benchmark (category, retrieval, caption)MSRVTTVATEXCREATE-210K