SynPO: Balance preference learning and language quality for fine-grained video captions

June 1, 20257 min

Overview

Decision SnapshotNeeds Validation

The method shows consistent empirical gains across several models and datasets and provides ablations; code is released, but reliance on LLM scorers and compute for enhanced inference are practical caveats.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SynPO cuts compute cost (~20%) and yields measurably better captions and preference metrics, so teams fine-tuning multimodal LMs can get higher-quality outputs faster without extra label collection.

Who Should Care

Summary TLDR

This paper presents SynPO, a preference-optimization method and an automatic pipeline for building preference pairs for video detailed captioning. The data pipeline uses a single vision-language model (VLM) with contrastive decoding and self-retrospective sampling, then scores candidates with an LLM on factuality, fluency, and self-consistency. SynPO modifies DPO to (1) prevent negative preferences from dominating training, (2) add a token-level language-preservation term, and (3) drop the reference model for ~20% faster training. SynPO improves video-caption quality (CIDEr +4.1 on VATEX for AuroraCap) and yields consistent gains on NLP preference benchmarks.

Problem Statement

Fine-grained video captioning needs temporally coherent, detailed descriptions, but datasets lack preference pairs and existing preference training (DPO) can drift away from language quality by overly suppressing negative preferences. This paper aims to build scalable preference data and redesign DPO to keep generative quality while improving preference alignment.

Main Contribution

Automated pipeline that builds high-quality preference pairs from one VLM using contrastive decoding, self-retrospective sampling, and LLM scoring over three criteria.

SynPO: a new preference-optimization objective that (a) rebalances positive/negative signals to avoid negative-dominant updates, (b) adds an explicit token-level language reward, and (c) removes the reference model to speed training.

Key Findings

SynPO improves video-caption metrics across models and datasets compared to DPO and SFT baselines.

NumbersVATEX CIDEr: 38.4 -> 42.5 (AuroraCap)

Practical UseUse SynPO to get visibly better caption quality on standard video benchmarks when fine-tuning VLM+LLM systems.

Evidence RefTable 3 (AuroraCap VATEX CIDEr)

Training with SynPO is faster because it removes the reference model.

Numbers≈20% training speedup vs. standard DPO

Practical UseExpect about one-fifth less training time and lower compute cost when replacing DPO with SynPO in similar setups.

Evidence RefAbstract; Sec 4.3; Implementation details

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
VATEX CIDEr (AuroraCap)42.538.4 (AuroraCap base)+4.1VATEXTable 3: AuroraCap base 38.4 -> SynPO 42.5Table 3
MSR-VTT CIDEr (AuroraCap)35.433.2 (AuroraCap base)+2.2MSR-VTTTable 3: AuroraCap base 33.2 -> SynPO 35.4Table 3

What To Try In 7 Days

Generate candidate captions with contrastive decoding + one self-refine pass to build cheap preference pairs.

Fine-tune an SFT model with SynPO (LoRA hooks) instead of DPO to reduce training time and maintain language fluency.

Use an off-the-shelf LLM (e.g., Qwen-Plus) to score factuality/fluency/self-consistency and validate preference labels on a small set.

Agent Features

Tool Use
LLM scoring for preference labelsLoRA

Optimization Features

Model Optimization
preference-aligned fine-tuning (SynPO objective)
System Optimization
removing ref model yields ~20% faster training
Training Optimization
reference-model-free optimization (saves runtime)LoRA
Inference Optimization
contrastive decoding to reduce hallucination

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Sharegpt4video (cited public dataset); VATEX; MSR-VTT (cited public datasets)

Risks & Boundaries

Limitations

Preference labels depend on an LLM judge (Qwen-Plus); judge bias or mistakes can propagate into training.

Contrastive decoding and self-retrospective sampling increase inference cost (self-retrospective ≈2×; contrastive +50–75%).

When Not To Use

If you lack an LLM scorer or cannot afford the extra inference cost to build preference pairs.

When extreme low-latency or tiny-hosted models are required and LoRA/inference overhead is infeasible.

Failure Modes

If LLM scoring is biased, SynPO will optimize toward that judge's preferences, possibly reducing real-world fidelity.

Using high learning rates or wrong α/β may still let negative preferences dominate or collapse language quality.

Core Entities

Models

SynPODPODPOPSimPOIPOKTOCPOAuroraCapLLaVA1.6-7B-videoInternVL2-8BLlama3-8BMistral-7B

Metrics

CIDErMETEORVDD LLM-based scoreAlpacaEval2 LC / win-rateMT-Bench (GPT-4 score)

Datasets

Sharegpt4videoPandas-70MCharadesMSR-VTTVATEXVDCVDDUltraFeedbackUltraChat-200k

Benchmarks

VDCVDDVATEXMSR-VTTAlpacaEval2MT-BenchHuggingFace Open LLM Leaderboard