Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
Youku-mPLUG provides a large, safety-filtered Chinese video-text corpus and benchmarks so teams can train or fine-tune Chinese multimodal models faster and compare results fairly.
Summary TLDR
This paper releases Youku-mPLUG: a public Chinese video-language corpus of 10 million filtered video–text pairs (from 400M raw videos) plus a 0.3–0.37M human-annotated benchmark covering retrieval, captioning, and category classification. The authors also release models (ALPRO, mPLUG-2) pre-trained on the dataset and propose mPLUG-video, a modular decoder-only model that uses a frozen LLM and trainable video encoder/abstractor. Pretraining on Youku-mPLUG gives large gains (up to +23.1% relative top-1 on category classification). mPLUG-video (2.7B) gets 80.57% top-1 and 68.9 CIDEr on the provided benchmarks. The dataset, code, and models are available on GitHub.
Problem Statement
The Chinese video-language community lacks a large, public, high-quality dataset and shared benchmarks. Existing large corpora are mostly English or proprietary, which slows model development and prevents fair comparisons for Chinese video-language models.
Main Contribution
A public Chinese video-language pre-training dataset Youku-mPLUG with 10 million high-quality video-text pairs filtered from 400M raw videos.
A human-annotated downstream benchmark (≈0.3–0.37M clips) covering video-text retrieval, video captioning, and video category classification.
Release of pre-trained models (ALPRO, mPLUG-2) and a new modular decoder-only model mPLUG-video that uses a frozen LLM and small trainable modules.
Demonstration that pretraining on Youku-mPLUG yields substantial gains (e.g., up to 23.1% relative in category top-1) and state-of-the-art benchmark results.
Key Findings
Pretraining on Youku-mPLUG substantially improves category classification.
mPLUG-video (2.7B) reaches top-1 80.57% on category classification.
mPLUG-video (2.7B) achieves strong captioning measured by CIDEr.
Freezing the language model hurts retrieval performance.
Dataset scale and filtering: 10M pairs filtered from 400M raw videos with safety checks.
Modular frozen-LLM approach keeps trainable parameters small.
Results
Video Category Classification Top-1
Video Captioning CIDEr
Video Retrieval R@1 (V2T)
Effect of Youku-mPLUG pretraining on category Top-1
Who Should Care
What To Try In 7 Days
Download the dataset and benchmark subset from the repo and run a quick eval on a public checkpoint.
Fine-tune the released mPLUG-video checkpoint on your domain-specific labels using the frozen-LLM setup to save compute.
Evaluate retrieval vs generation trade-offs: test adding a contrastive head if retrieval matters for your app.
Agent Features
Frameworks
- TimeSformer
- CLIP (Chinese)
- Bloomz
Architectures
- decoder-only
- modularized (frozen LLM + trainable encoder/abstractor)
Optimization Features
Token Efficiency
- Reduce video sequence length with M learnable tokens
Model Optimization
- Keep large LLM frozen and only train small modules (1.7% params reported)
System Optimization
- Use frozen LLM to lower fine-tuning compute
Training Optimization
- Sparse frame sampling (8 frames per clip)
- Batch size 512, 10 pretraining epochs
Inference Optimization
- Visual abstractor reduces video token length via learnable queries
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Data reflects language and concepts available at collection time and may miss future terms or new visuals.
- Content skews to Chinese Internet culture and may not generalize cross-culturally.
- Dataset excludes very long videos and long-form text, limiting use for full-length movies or long transcripts.
When Not To Use
- When you need up-to-date cultural or temporal facts not present at collection time.
- For very long-video understanding tasks (full movies, long transcripts).
- If you require unfiltered raw web data (the corpus is safety-filtered).
Failure Modes
- Freezing the language model reduces cross-modal alignment and retrieval performance.
- Auto-generated category labels used in initial selection are imperfect (~94% historic accuracy) and require careful manual verification.
- Cultural and time biases in training data may cause model blind spots on novel or non-Chinese content.
Core Entities
Models
- mPLUG-video
- mPLUG-2
- ALPRO
- Bloomz
- TimeSformer
- CLIP (Chinese)
Metrics
- Accuracy
- CIDEr
- BLEU-4
- METEOR
- ROUGE
- Recall@k (R@1,R@5,R@10)
Datasets
- Youku-mPLUG
- WebVid10M
- HowTo100M
- ALIVOL-10M
- Kwai-SVC-11M
- CREATE-10M
- CNVid-3.5M
Benchmarks
- Youku-mPLUG benchmark (category, retrieval, caption)
- MSRVTT
- VATEX
- CREATE-210K

