Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
SynPO cuts compute cost (~20%) and yields measurably better captions and preference metrics, so teams fine-tuning multimodal LMs can get higher-quality outputs faster without extra label collection.
Summary TLDR
This paper presents SynPO, a preference-optimization method and an automatic pipeline for building preference pairs for video detailed captioning. The data pipeline uses a single vision-language model (VLM) with contrastive decoding and self-retrospective sampling, then scores candidates with an LLM on factuality, fluency, and self-consistency. SynPO modifies DPO to (1) prevent negative preferences from dominating training, (2) add a token-level language-preservation term, and (3) drop the reference model for ~20% faster training. SynPO improves video-caption quality (CIDEr +4.1 on VATEX for AuroraCap) and yields consistent gains on NLP preference benchmarks.
Problem Statement
Fine-grained video captioning needs temporally coherent, detailed descriptions, but datasets lack preference pairs and existing preference training (DPO) can drift away from language quality by overly suppressing negative preferences. This paper aims to build scalable preference data and redesign DPO to keep generative quality while improving preference alignment.
Main Contribution
Automated pipeline that builds high-quality preference pairs from one VLM using contrastive decoding, self-retrospective sampling, and LLM scoring over three criteria.
SynPO: a new preference-optimization objective that (a) rebalances positive/negative signals to avoid negative-dominant updates, (b) adds an explicit token-level language reward, and (c) removes the reference model to speed training.
Extensive experiments showing SynPO outperforms DPO variants on video-captioning and NLP preference benchmarks, and reduces training time by about 20%.
Key Findings
SynPO improves video-caption metrics across models and datasets compared to DPO and SFT baselines.
Training with SynPO is faster because it removes the reference model.
Using contrastive decoding + self-retrospective sampling raises caption quality vs. baseline inference.
SynPO yields consistent gains on general NLP preference benchmarks and leaderboards.
Results
VATEX CIDEr (AuroraCap)
MSR-VTT CIDEr (AuroraCap)
VDD LLM-based score (AuroraCap)
Training efficiency
AlpacaEval2 length-controlled (Mistral-7B-Base)
Who Should Care
What To Try In 7 Days
Generate candidate captions with contrastive decoding + one self-refine pass to build cheap preference pairs.
Fine-tune an SFT model with SynPO (LoRA hooks) instead of DPO to reduce training time and maintain language fluency.
Use an off-the-shelf LLM (e.g., Qwen-Plus) to score factuality/fluency/self-consistency and validate preference labels on a small set.
Agent Features
Tool Use
- LLM scoring for preference labels
- LoRA
Optimization Features
Model Optimization
- preference-aligned fine-tuning (SynPO objective)
System Optimization
- removing ref model yields ~20% faster training
Training Optimization
- reference-model-free optimization (saves runtime)
- LoRA
Inference Optimization
- contrastive decoding to reduce hallucination
Reproducibility
Data Urls
- Sharegpt4video (cited public dataset); VATEX; MSR-VTT (cited public datasets)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Preference labels depend on an LLM judge (Qwen-Plus); judge bias or mistakes can propagate into training.
- Contrastive decoding and self-retrospective sampling increase inference cost (self-retrospective ≈2×; contrastive +50–75%).
- Hyperparameters (α, β) and learning rate remain sensitive; improper settings can still degrade language capability.
When Not To Use
- If you lack an LLM scorer or cannot afford the extra inference cost to build preference pairs.
- When extreme low-latency or tiny-hosted models are required and LoRA/inference overhead is infeasible.
Failure Modes
- If LLM scoring is biased, SynPO will optimize toward that judge's preferences, possibly reducing real-world fidelity.
- Using high learning rates or wrong α/β may still let negative preferences dominate or collapse language quality.
- Excessive sampling/self-retrospective steps without cost control can make data construction impractical.
Core Entities
Models
- SynPO
- DPO
- DPOP
- SimPO
- IPO
- KTO
- CPO
- AuroraCap
- LLaVA1.6-7B-video
- InternVL2-8B
- Llama3-8B
- Mistral-7B
Metrics
- CIDEr
- METEOR
- VDD LLM-based score
- AlpacaEval2 LC / win-rate
- MT-Bench (GPT-4 score)
Datasets
- Sharegpt4video
- Pandas-70M
- Charades
- MSR-VTT
- VATEX
- VDC
- VDD
- UltraFeedback
- UltraChat-200k
Benchmarks
- VDC
- VDD
- VATEX
- MSR-VTT
- AlpacaEval2
- MT-Bench
- HuggingFace Open LLM Leaderboard

