SynPO: Balance preference learning and language quality for fine-grained video captions

June 1, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu

Links

Abstract / PDF

Why It Matters For Business

SynPO cuts compute cost (~20%) and yields measurably better captions and preference metrics, so teams fine-tuning multimodal LMs can get higher-quality outputs faster without extra label collection.

Summary TLDR

This paper presents SynPO, a preference-optimization method and an automatic pipeline for building preference pairs for video detailed captioning. The data pipeline uses a single vision-language model (VLM) with contrastive decoding and self-retrospective sampling, then scores candidates with an LLM on factuality, fluency, and self-consistency. SynPO modifies DPO to (1) prevent negative preferences from dominating training, (2) add a token-level language-preservation term, and (3) drop the reference model for ~20% faster training. SynPO improves video-caption quality (CIDEr +4.1 on VATEX for AuroraCap) and yields consistent gains on NLP preference benchmarks.

Problem Statement

Fine-grained video captioning needs temporally coherent, detailed descriptions, but datasets lack preference pairs and existing preference training (DPO) can drift away from language quality by overly suppressing negative preferences. This paper aims to build scalable preference data and redesign DPO to keep generative quality while improving preference alignment.

Main Contribution

Automated pipeline that builds high-quality preference pairs from one VLM using contrastive decoding, self-retrospective sampling, and LLM scoring over three criteria.

SynPO: a new preference-optimization objective that (a) rebalances positive/negative signals to avoid negative-dominant updates, (b) adds an explicit token-level language reward, and (c) removes the reference model to speed training.

Extensive experiments showing SynPO outperforms DPO variants on video-captioning and NLP preference benchmarks, and reduces training time by about 20%.

Key Findings

SynPO improves video-caption metrics across models and datasets compared to DPO and SFT baselines.

NumbersVATEX CIDEr: 38.4 -> 42.5 (AuroraCap)

Training with SynPO is faster because it removes the reference model.

Numbers≈20% training speedup vs. standard DPO

Using contrastive decoding + self-retrospective sampling raises caption quality vs. baseline inference.

NumbersAccuracy +6.7%; Richness +5.5% (combined gains reported)

SynPO yields consistent gains on general NLP preference benchmarks and leaderboards.

NumbersAlpacaEval2 LC: DPO 15.1% -> SynPO 22.9% (Mistral-7B-Base)

Results

VATEX CIDEr (AuroraCap)

Value42.5

Baseline38.4 (AuroraCap base)

MSR-VTT CIDEr (AuroraCap)

Value35.4

Baseline33.2 (AuroraCap base)

VDD LLM-based score (AuroraCap)

Value2.43

Baseline2.00 (AuroraCap base)

Training efficiency

Value≈20% faster

Baselinestandard DPO

AlpacaEval2 length-controlled (Mistral-7B-Base)

Value22.9%

Baseline15.1% (DPO)

Who Should Care

What To Try In 7 Days

Generate candidate captions with contrastive decoding + one self-refine pass to build cheap preference pairs.

Fine-tune an SFT model with SynPO (LoRA hooks) instead of DPO to reduce training time and maintain language fluency.

Use an off-the-shelf LLM (e.g., Qwen-Plus) to score factuality/fluency/self-consistency and validate preference labels on a small set.

Agent Features

Tool Use

  • LLM scoring for preference labels
  • LoRA

Optimization Features

Model Optimization

  • preference-aligned fine-tuning (SynPO objective)

System Optimization

  • removing ref model yields ~20% faster training

Training Optimization

  • reference-model-free optimization (saves runtime)
  • LoRA

Inference Optimization

  • contrastive decoding to reduce hallucination

Reproducibility

Data Urls

  • Sharegpt4video (cited public dataset); VATEX; MSR-VTT (cited public datasets)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Preference labels depend on an LLM judge (Qwen-Plus); judge bias or mistakes can propagate into training.
  • Contrastive decoding and self-retrospective sampling increase inference cost (self-retrospective ≈2×; contrastive +50–75%).
  • Hyperparameters (α, β) and learning rate remain sensitive; improper settings can still degrade language capability.

When Not To Use

  • If you lack an LLM scorer or cannot afford the extra inference cost to build preference pairs.
  • When extreme low-latency or tiny-hosted models are required and LoRA/inference overhead is infeasible.

Failure Modes

  • If LLM scoring is biased, SynPO will optimize toward that judge's preferences, possibly reducing real-world fidelity.
  • Using high learning rates or wrong α/β may still let negative preferences dominate or collapse language quality.
  • Excessive sampling/self-retrospective steps without cost control can make data construction impractical.

Core Entities

Models

  • SynPO
  • DPO
  • DPOP
  • SimPO
  • IPO
  • KTO
  • CPO
  • AuroraCap
  • LLaVA1.6-7B-video
  • InternVL2-8B
  • Llama3-8B
  • Mistral-7B

Metrics

  • CIDEr
  • METEOR
  • VDD LLM-based score
  • AlpacaEval2 LC / win-rate
  • MT-Bench (GPT-4 score)

Datasets

  • Sharegpt4video
  • Pandas-70M
  • Charades
  • MSR-VTT
  • VATEX
  • VDC
  • VDD
  • UltraFeedback
  • UltraChat-200k

Benchmarks

  • VDC
  • VDD
  • VATEX
  • MSR-VTT
  • AlpacaEval2
  • MT-Bench
  • HuggingFace Open LLM Leaderboard