BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

Overview

Decision SnapshotNeeds Validation

The method shows consistent gains across four public datasets and three judge models, with clear ablations and stable optimization behavior, but code and larger-scale studies are not provided.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 65%

Authors

Bo Pan, Xuan Kan, Kaitai Zhang, Yan Yan, Shunwen Tan, Zihao He, Zixin Ding, Junjie Wu, Liang Zhao

Links

Abstract / PDF

Why It Matters For Business

BLPO reduces the need to fine-tune judges for each image task by improving automated evaluation with prompt updates; this lowers annotation and retraining cost while giving more human-aligned metrics for product testing.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

This paper introduces BLPO, a bi-level prompt optimization method for multimodal LLMs used as automated judges of images. BLPO jointly optimizes (1) the judge prompt that tells the model how to score and (2) an image-to-text (I2T) prompt that tells an MLLM how to verbalize images. Converting images to tailored text saves context budget and preserves evaluation-relevant visual cues. Experiments on four datasets and three judge backbones show BLPO improves alignment with human labels, converges within ~5 optimization rounds, and works best with 10–15 error examples per batch.

Problem Statement

Current automated judges struggle to match human image evaluations because multimodal models have limited visual-context capacity. Trial-and-error prompt search needs many error examples, but MLLMs cannot process many images at once. Naive captioning loses task-specific visual details. We need a method that preserves evaluation-relevant image cues while staying within context limits.

Main Contribution

Identify limited visual-context capacity as a bottleneck for prompt optimization of multimodal judges.

Propose BLPO: a bi-level framework that jointly optimizes the judge prompt and a learnable image-to-text (I2T) prompt.

Key Findings

BLPO improves UnsafeBench F1 vs second-best by ~8%

NumbersUnsafeBench: BLPO F1=0.89 vs next best ≈0.82 (Table 1)

Practical UseExpect noticeably better safety-classification alignment when optimizing prompts with BLPO rather than existing APO baselines.

Evidence RefTable 1

BLPO converges within about 5 outer optimization rounds

NumbersOptimization stabilizes by 5 rounds (Implementation + Fig.4)

Practical UseYou can run only ~5 iterations and get most gains, saving compute and latency.

Evidence RefSection 4.1.3, Fig.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
UnsafeBench F1 (example)	BLPO 0.89 ±0.02 vs next best ≈0.82	best baseline ≈0.82	+~8% (relative)	UnsafeBench	Table 1 main results	Table 1
Convergence rounds	Converges within ~5 outer iterations	N/A	—	All evaluated datasets	Implementation details and Fig.4 show stabilization at five iterations	Section 4.1.3, Fig.4

What To Try In 7 Days

Run BLPO on an existing MLLM judge: use GPT-o3 as optimizer and your current judge as frozen model.

Use ~10 error examples per optimization step and run 3–5 outer iterations to see quick gains.

Make the image-to-text prompt learnable instead of fixed captions to capture task-specific visual cues. Test against a fixed-caption baseline for comparison.

Agent Features

Tool Use

LLM-as-optimizer (GPT-o3)

Optimization Features

Token Efficiency

reduces visual tokens by verbalizing images

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Relies on a strong optimizer LLM (GPT-o3) as a black box which adds cost and external dependency.

Experiments use small sampled splits (many datasets downsampled), limiting claims about large-scale generalization.

When Not To Use

When you can afford to fine-tune a dedicated multimodal critic on large human-labeled data.

When required judgments depend on pixel-level differences that text cannot capture.

Failure Modes

Overfitting to the small error set used for updates, reducing generalization.

Optimizer LLM may introduce subtle instruction shifts that bias judgments.

Core Entities

Models

Llama-4-Scout-17B-16E-instructLlama-4-Maverick-17B-128E-instructQwen2.5-VL-32B-instructGPT-o3 (optimizer LLM)

Metrics

F1AccuracyMacro F1

Datasets

AGINSeeTRUEImageRewardUnsafeBench

Benchmarks

ImageRewardSeeTRUEAGINUnsafeBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

BLPO improves UnsafeBench F1 vs second-best by ~8%

BLPO converges within about 5 outer optimization rounds

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding