BLPO: jointly optimize judge and caption prompts to better align multimodal LLM judges with human image judgments

February 11, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

0

Authors

Bo Pan, Xuan Kan, Kaitai Zhang, Yan Yan, Shunwen Tan, Zihao He, Zixin Ding, Junjie Wu, Liang Zhao

Links

Abstract / PDF

Why It Matters For Business

BLPO reduces the need to fine-tune judges for each image task by improving automated evaluation with prompt updates; this lowers annotation and retraining cost while giving more human-aligned metrics for product testing.

Summary TLDR

This paper introduces BLPO, a bi-level prompt optimization method for multimodal LLMs used as automated judges of images. BLPO jointly optimizes (1) the judge prompt that tells the model how to score and (2) an image-to-text (I2T) prompt that tells an MLLM how to verbalize images. Converting images to tailored text saves context budget and preserves evaluation-relevant visual cues. Experiments on four datasets and three judge backbones show BLPO improves alignment with human labels, converges within ~5 optimization rounds, and works best with 10–15 error examples per batch.

Problem Statement

Current automated judges struggle to match human image evaluations because multimodal models have limited visual-context capacity. Trial-and-error prompt search needs many error examples, but MLLMs cannot process many images at once. Naive captioning loses task-specific visual details. We need a method that preserves evaluation-relevant image cues while staying within context limits.

Main Contribution

Identify limited visual-context capacity as a bottleneck for prompt optimization of multimodal judges.

Propose BLPO: a bi-level framework that jointly optimizes the judge prompt and a learnable image-to-text (I2T) prompt.

Show BLPO improves alignment with human labels across four datasets and three MLLM judge backbones and converges in a few rounds.

Key Findings

BLPO improves UnsafeBench F1 vs second-best by ~8%

NumbersUnsafeBench: BLPO F1=0.89 vs next best ≈0.82 (Table 1)

BLPO converges within about 5 outer optimization rounds

NumbersOptimization stabilizes by 5 rounds (Implementation + Fig.4)

Best batch size is moderate: 10–15 error examples

NumbersPerformance peaks near batch size 10–15 then declines (Fig.4a,d)

Adaptive I2T prompt helps: BLPO beats fixed and judge-based I2T variants

NumbersAblation on Llama4-Scout UnsafeBench F1s: Fixed 0.73 → judge-based 0.78 → BLPO 0.81 (Table 2)

Results

UnsafeBench F1 (example)

ValueBLPO 0.89 ±0.02 vs next best ≈0.82

Baselinebest baseline ≈0.82

Convergence rounds

ValueConverges within ~5 outer iterations

BaselineN/A

Best batch size

ValueOptimal near 10–15 error examples

BaselineLarger or smaller batches degrade performance

Ablation: fixed vs adaptive I2T

ValueFixed I2T F1=0.73 → judge-based 0.78 → BLPO 0.81 (UnsafeBench)

BaselineFixed I2T

Who Should Care

What To Try In 7 Days

Run BLPO on an existing MLLM judge: use GPT-o3 as optimizer and your current judge as frozen model.

Use ~10 error examples per optimization step and run 3–5 outer iterations to see quick gains.

Make the image-to-text prompt learnable instead of fixed captions to capture task-specific visual cues. Test against a fixed-caption baseline for comparison.

Agent Features

Tool Use

  • LLM-as-optimizer (GPT-o3)

Optimization Features

Token Efficiency

  • reduces visual tokens by verbalizing images

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on a strong optimizer LLM (GPT-o3) as a black box which adds cost and external dependency.
  • Experiments use small sampled splits (many datasets downsampled), limiting claims about large-scale generalization.
  • Verbalized captions may still miss pixel-level cues needed for some image tasks.
  • Method assumes frozen judge models; it does not replace full fine-tuning when that is feasible.

When Not To Use

  • When you can afford to fine-tune a dedicated multimodal critic on large human-labeled data.
  • When required judgments depend on pixel-level differences that text cannot capture.
  • If you cannot access or afford a capable optimizer LLM for prompt updates.

Failure Modes

  • Overfitting to the small error set used for updates, reducing generalization.
  • Optimizer LLM may introduce subtle instruction shifts that bias judgments.
  • I2T prompts might omit critical visual details, degrading judge accuracy.
  • Performance can drop if batch size is too large or too small outside 10–15 range.

Core Entities

Models

  • Llama-4-Scout-17B-16E-instruct
  • Llama-4-Maverick-17B-128E-instruct
  • Qwen2.5-VL-32B-instruct
  • GPT-o3 (optimizer LLM)

Metrics

  • F1
  • Accuracy
  • Macro F1

Datasets

  • AGIN
  • SeeTRUE
  • ImageReward
  • UnsafeBench

Benchmarks

  • ImageReward
  • SeeTRUE
  • AGIN
  • UnsafeBench