Youku-mPLUG: 10M filtered Chinese video-text pairs plus human benchmarks and models

Overview

Decision SnapshotReady For Pilot

Dataset release plus model checkpoints and clear benchmarks make this work actionable for teams building Chinese video-language features; freezing large LLMs reduces tuning cost but can harm retrieval without extra heads.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Youku-mPLUG provides a large, safety-filtered Chinese video-text corpus and benchmarks so teams can train or fine-tune Chinese multimodal models faster and compare results fairly.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

This paper releases Youku-mPLUG: a public Chinese video-language corpus of 10 million filtered video–text pairs (from 400M raw videos) plus a 0.3–0.37M human-annotated benchmark covering retrieval, captioning, and category classification. The authors also release models (ALPRO, mPLUG-2) pre-trained on the dataset and propose mPLUG-video, a modular decoder-only model that uses a frozen LLM and trainable video encoder/abstractor. Pretraining on Youku-mPLUG gives large gains (up to +23.1% relative top-1 on category classification). mPLUG-video (2.7B) gets 80.57% top-1 and 68.9 CIDEr on the provided benchmarks. The dataset, code, and models are available on GitHub.

Problem Statement

The Chinese video-language community lacks a large, public, high-quality dataset and shared benchmarks. Existing large corpora are mostly English or proprietary, which slows model development and prevents fair comparisons for Chinese video-language models.

Main Contribution

A public Chinese video-language pre-training dataset Youku-mPLUG with 10 million high-quality video-text pairs filtered from 400M raw videos.

A human-annotated downstream benchmark (≈0.3–0.37M clips) covering video-text retrieval, video captioning, and video category classification.

Key Findings

Pretraining on Youku-mPLUG substantially improves category classification.

NumbersTop-1: 63.51% -> 78.15% (+23.1% relative)

Practical UseIf you pretrain on this 10M Chinese dataset, expect large accuracy gains for video category tasks compared to no pretraining.

Evidence RefTable 6; Abstract

mPLUG-video (2.7B) reaches top-1 80.57% on category classification.

NumbersTop-1: 80.57% (Youku-mPLUG test)

Practical UseUse mPLUG-video as a strong baseline for Chinese video category classification on this benchmark.

Evidence RefTable 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Video Category Classification Top-1	80.57% (mPLUG-video 2.7B)	78.15% (ALPRO)	+2.42% abs	Youku-mPLUG test	Table 4 reports mPLUG-video (2.7B) Top-1 80.57% and ALPRO 78.15%	Table 4
Video Captioning CIDEr	68.9 (mPLUG-video 2.7B)	67.7 (mPLUG-2)	+1.2 CIDEr	Youku-mPLUG caption test	Table 4 shows CIDEr 68.9 vs 67.7	Table 4

What To Try In 7 Days

Download the dataset and benchmark subset from the repo and run a quick eval on a public checkpoint.

Fine-tune the released mPLUG-video checkpoint on your domain-specific labels using the frozen-LLM setup to save compute.

Evaluate retrieval vs generation trade-offs: test adding a contrastive head if retrieval matters for your app.

Agent Features

Frameworks

TimeSformerCLIP (Chinese)Bloomz

Architectures

decoder-onlymodularized (frozen LLM + trainable encoder/abstractor)

Optimization Features

Token Efficiency

Reduce video sequence length with M learnable tokens

Model Optimization

Keep large LLM frozen and only train small modules (1.7% params reported)

System Optimization

Use frozen LLM to lower fine-tuning compute

Training Optimization

Sparse frame sampling (8 frames per clip)Batch size 512, 10 pretraining epochs

Inference Optimization

Visual abstractor reduces video token length via learnable queries

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/X-PLUG/Youku-mPLUG

Data URLs

https://github.com/X-PLUG/Youku-mPLUG

Risks & Boundaries

Limitations

Data reflects language and concepts available at collection time and may miss future terms or new visuals.

Content skews to Chinese Internet culture and may not generalize cross-culturally.

When Not To Use

When you need up-to-date cultural or temporal facts not present at collection time.

For very long-video understanding tasks (full movies, long transcripts).

Failure Modes

Freezing the language model reduces cross-modal alignment and retrieval performance.

Auto-generated category labels used in initial selection are imperfect (~94% historic accuracy) and require careful manual verification.

Core Entities

Models

mPLUG-videomPLUG-2ALPROBloomzTimeSformerCLIP (Chinese)

Metrics

AccuracyCIDErBLEU-4METEORROUGERecall@k (R@1,R@5,R@10)

Datasets

Youku-mPLUGWebVid10MHowTo100MALIVOL-10MKwai-SVC-11MCREATE-10MCNVid-3.5M

Benchmarks

Youku-mPLUG benchmark (category, retrieval, caption)MSRVTTVATEXCREATE-210K

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Pretraining on Youku-mPLUG substantially improves category classification.

mPLUG-video (2.7B) reaches top-1 80.57% on category classification.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding