Overview
Dataset release plus model checkpoints and clear benchmarks make this work actionable for teams building Chinese video-language features; freezing large LLMs reduces tuning cost but can harm retrieval without extra heads.
Citations4
Evidence Strength0.80
Confidence0.90
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Youku-mPLUG provides a large, safety-filtered Chinese video-text corpus and benchmarks so teams can train or fine-tune Chinese multimodal models faster and compare results fairly.
Who Should Care
Summary TLDR
This paper releases Youku-mPLUG: a public Chinese video-language corpus of 10 million filtered video–text pairs (from 400M raw videos) plus a 0.3–0.37M human-annotated benchmark covering retrieval, captioning, and category classification. The authors also release models (ALPRO, mPLUG-2) pre-trained on the dataset and propose mPLUG-video, a modular decoder-only model that uses a frozen LLM and trainable video encoder/abstractor. Pretraining on Youku-mPLUG gives large gains (up to +23.1% relative top-1 on category classification). mPLUG-video (2.7B) gets 80.57% top-1 and 68.9 CIDEr on the provided benchmarks. The dataset, code, and models are available on GitHub.
Problem Statement
The Chinese video-language community lacks a large, public, high-quality dataset and shared benchmarks. Existing large corpora are mostly English or proprietary, which slows model development and prevents fair comparisons for Chinese video-language models.
Main Contribution
A public Chinese video-language pre-training dataset Youku-mPLUG with 10 million high-quality video-text pairs filtered from 400M raw videos.
A human-annotated downstream benchmark (≈0.3–0.37M clips) covering video-text retrieval, video captioning, and video category classification.
Key Findings
Pretraining on Youku-mPLUG substantially improves category classification.
mPLUG-video (2.7B) reaches top-1 80.57% on category classification.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Video Category Classification Top-1 | 80.57% (mPLUG-video 2.7B) | 78.15% (ALPRO) | +2.42% abs | Youku-mPLUG test | Table 4 reports mPLUG-video (2.7B) Top-1 80.57% and ALPRO 78.15% | Table 4 |
| Video Captioning CIDEr | 68.9 (mPLUG-video 2.7B) | 67.7 (mPLUG-2) | +1.2 CIDEr | Youku-mPLUG caption test | Table 4 shows CIDEr 68.9 vs 67.7 | Table 4 |
What To Try In 7 Days
Download the dataset and benchmark subset from the repo and run a quick eval on a public checkpoint.
Fine-tune the released mPLUG-video checkpoint on your domain-specific labels using the frozen-LLM setup to save compute.
Evaluate retrieval vs generation trade-offs: test adding a contrastive head if retrieval matters for your app.
Agent Features
Frameworks
Architectures
Optimization Features
Token Efficiency
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Data reflects language and concepts available at collection time and may miss future terms or new visuals.
Content skews to Chinese Internet culture and may not generalize cross-culturally.
When Not To Use
When you need up-to-date cultural or temporal facts not present at collection time.
For very long-video understanding tasks (full movies, long transcripts).
Failure Modes
Freezing the language model reduces cross-modal alignment and retrieval performance.
Auto-generated category labels used in initial selection are imperfect (~94% historic accuracy) and require careful manual verification.

