Train on 1K text rationales to build a judge that scores images, audio, video and molecules zero-shot

May 24, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

Links

Abstract / PDF

Why It Matters For Business

You can build practical, low-cost multimodal evaluators by fine-tuning a capable multimodal LLM on a small set (~1K) of high-quality text rationales instead of collecting large modality-specific annotation sets.

Summary TLDR

FLEX-Judge fine-tunes a multimodal LLM on a small (≈1K) curated set of text-only reasoning annotations. The model learns to give structured explanations (<think> chains) and transfers those decision rules to evaluate images, video, audio, and molecules without modality-specific training. On several benchmarks it matches or beats larger or modality-trained judges (e.g., equals GPT‑4o on GenAI-Bench with majority voting) and drives practical tasks like best-of-N selection and DPO-based fine-tuning in the molecular domain. The method is low-cost (short fine-tune runs on 2 A6000 GPUs) but depends on a strong LLM backbone and careful data quality.

Problem Statement

High-quality human feedback is costly and multimodal preference datasets are scarce. Existing multimodal judge models need large modality-specific annotation sets. The paper asks whether a small corpus of high-quality textual reasoning explanations is enough to train a multimodal judge that generalizes across modalities and evaluation formats.

Main Contribution

Show that training a multimodal judge on ≈1K high-quality text reasoning annotations yields strong zero-shot multimodal evaluation.

Introduce FLEX-Judge: fine-tune MLLMs (Qwen2.5-VL/Omni) on reasoning-first outputs and support single-score, pairwise and batch ranking formats.

Demonstrate competitive or superior performance vs commercial APIs and large open-source multimodal judges across vision, audio, video and a molecular case study.

Show practical uses: best-of-N selection and producing DPO training triplets for molecular LLMs, improving downstream accuracy.

Key Findings

Reasoning-first fine-tuning on ~1K text examples yields strong multimodal judges.

Numbers1K training samples vs 113K/150K used by LLaVA-Critic/Prometheus-Vision

FLEX-VL-7B with majority voting matches or slightly exceeds GPT-4o on GenAI-Bench overall.

NumbersFLEX-VL-7B + majority voting overall 49.29 vs GPT-4o 49.2

FLEX-Omni-7B improves speech quality correlation versus training-free baselines.

NumbersNISQA utterance LCC 0.545 (FLEX-Omni-7B) vs 0.408 (Gemini-2.0-Flash)

Using FLEX-Mol-LLaMA as a judge for reward-guided training yields strong molecular accuracy.

NumbersBest-of-N accuracy up to 77.49% (N=16); DPO fine-tuning reaches 80.10% downstream accuracy

Results

GenAI-Bench overall (majority voting)

Value49.29 (FLEX-VL-7B + majority voting)

Baseline49.2 (GPT-4o)

MLLM-as-a-Judge (pair w. tie) average

Value0.538 (FLEX-VL-7B)

Baseline0.717 (GPT-4V)

VL-RewardBench overall (macro/overall)

Value48.02 (FLEX-Omni-7B)

Baseline65.8 (GPT-4o reported)

Audio NISQA utterance-level LCC

Value0.545 (FLEX-Omni-7B)

Baseline0.408 (Gemini-2.0-Flash)

Accuracy

Value80.10%

Baselineprevious state-of-the-art (lower; varied by model)

Who Should Care

What To Try In 7 Days

Fine-tune an existing vision/audio-capable LLM on ~1K curated text reasoning examples and test zero-shot on one image or audio benchmark.

Add inference-time majority voting to the tuned judge and compare scores vs a baseline API on a small validation set.

Use the judge to rank N sampled outputs (best-of-N) for a domain task and measure downstream task improvement.

Optimization Features

Infra Optimization

  • Short fine-tune (≈1.5 hours on 2 A6000 GPUs for 7B model)

System Optimization

  • Fine-tune LLM backbone only; reuse modality adapters

Training Optimization

  • Small-data fine-tuning (1K examples)
  • On-policy, low-temperature sample selection

Inference Optimization

  • Inference-time scaling: majority voting
  • Budget forcing / self-refinement

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Depends on a strong LLM backbone able to produce and consume structured reasoning; weak backbones fail (see 3D-LLM attempt).
  • Position bias: models prefer one response position and underuse the 'Tie' option unless mitigated by randomization.
  • Catastrophic forgetting risk if overfitting to too much text-only data; paper limits to ~1K to preserve multimodal abilities.
  • Not proven on every modality (e.g., 3D point clouds failed due to backbone limits).

When Not To Use

  • If your base MLLM lacks strong textual reasoning pretraining or has a small context window.
  • When modality-specific, high-quality labeled preference data already exist and are affordable.
  • For safety-critical audits where human raters are legally required.

Failure Modes

  • Judge follows length or position biases without mitigation, skewing rankings.
  • Overfitting to on-policy reasoning samples causes drop in modality perception (catastrophic forgetting).
  • Reasoning-first training may still mis-evaluate highly domain-specific signals if the LLM lacks domain knowledge.

Core Entities

Models

  • FLEX-Omni-7B
  • FLEX-VL-7B
  • FLEX-Mol-LLaMA
  • Qwen2.5-VL-7B
  • Qwen2.5-Omni-7B
  • JudgeLRM-7B
  • Mol-LLaMA
  • GPT-4o
  • Gemini-1.5-Pro
  • LLaVA-Critic-7B
  • Prometheus-Vision-13B
  • Qwen2.5-VL-3B

Metrics

  • Pearson correlation
  • Accuracy
  • Normalized Levenshtein distance
  • Linear correlation coefficient (LCC)
  • Spearman rank correlation (SRCC)

Datasets

  • JudgeLM-100K
  • MLLM-as-a-Judge
  • VL-RewardBench
  • MJ-Bench
  • GenAI-Bench
  • NISQA
  • BVCC
  • SOMOS
  • VoxSim
  • RLHF-V
  • JudgeAnything
  • MMRB

Benchmarks

  • MLLM-as-a-Judge
  • VL-RewardBench
  • MJ-Bench
  • GenAI-Bench
  • Audio MOS/SS (NISQA, BVCC, SOMOS, VoxSim)
  • MMRB
  • JudgeAnything