Train on 1K text rationales to build a judge that scores images, audio, video and molecules zero-shot

May 24, 20257 min

Overview

Decision SnapshotReady For Pilot

The approach is practical: low-cost fine-tuning on 1K reasoning examples gave consistent gains across benchmarks and a real molecular use case, but success depends on a capable LLM backbone and careful data curation.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can build practical, low-cost multimodal evaluators by fine-tuning a capable multimodal LLM on a small set (~1K) of high-quality text rationales instead of collecting large modality-specific annotation sets.

Who Should Care

Summary TLDR

FLEX-Judge fine-tunes a multimodal LLM on a small (≈1K) curated set of text-only reasoning annotations. The model learns to give structured explanations (<think> chains) and transfers those decision rules to evaluate images, video, audio, and molecules without modality-specific training. On several benchmarks it matches or beats larger or modality-trained judges (e.g., equals GPT‑4o on GenAI-Bench with majority voting) and drives practical tasks like best-of-N selection and DPO-based fine-tuning in the molecular domain. The method is low-cost (short fine-tune runs on 2 A6000 GPUs) but depends on a strong LLM backbone and careful data quality.

Problem Statement

High-quality human feedback is costly and multimodal preference datasets are scarce. Existing multimodal judge models need large modality-specific annotation sets. The paper asks whether a small corpus of high-quality textual reasoning explanations is enough to train a multimodal judge that generalizes across modalities and evaluation formats.

Main Contribution

Show that training a multimodal judge on ≈1K high-quality text reasoning annotations yields strong zero-shot multimodal evaluation.

Introduce FLEX-Judge: fine-tune MLLMs (Qwen2.5-VL/Omni) on reasoning-first outputs and support single-score, pairwise and batch ranking formats.

Key Findings

Reasoning-first fine-tuning on ~1K text examples yields strong multimodal judges.

Numbers1K training samples vs 113K/150K used by LLaVA-Critic/Prometheus-Vision

Practical UseTry a small, high-quality reasoning seed dataset before collecting large modality-specific labels; it can cut annotation cost by ~100x while keeping strong judge performance.

Evidence RefSec.2.2; Table 1 text

FLEX-VL-7B with majority voting matches or slightly exceeds GPT-4o on GenAI-Bench overall.

NumbersFLEX-VL-7B + majority voting overall 49.29 vs GPT-4o 49.2

Practical UseUse inference-time scaling (majority voting) with a reasoning-tuned judge to reach commercial-grade judgments on generation benchmarks.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
GenAI-Bench overall (majority voting)49.29 (FLEX-VL-7B + majority voting)49.2 (GPT-4o)+0.09GenAI-BenchTable 3 GenAI-Bench resultsTable 3
MLLM-as-a-Judge (pair w. tie) average0.538 (FLEX-VL-7B)0.717 (GPT-4V)-0.179MLLM-as-a-Judge (pair, w. tie)Table 1 average per-model scoresTable 1

What To Try In 7 Days

Fine-tune an existing vision/audio-capable LLM on ~1K curated text reasoning examples and test zero-shot on one image or audio benchmark.

Add inference-time majority voting to the tuned judge and compare scores vs a baseline API on a small validation set.

Use the judge to rank N sampled outputs (best-of-N) for a domain task and measure downstream task improvement.

Optimization Features

Infra Optimization
Short fine-tune (≈1.5 hours on 2 A6000 GPUs for 7B model)
System Optimization
Fine-tune LLM backbone only; reuse modality adapters
Training Optimization
Small-data fine-tuning (1K examples)On-policy, low-temperature sample selection
Inference Optimization
Inference-time scaling: majority votingBudget forcing / self-refinement

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Depends on a strong LLM backbone able to produce and consume structured reasoning; weak backbones fail (see 3D-LLM attempt).

Position bias: models prefer one response position and underuse the 'Tie' option unless mitigated by randomization.

When Not To Use

If your base MLLM lacks strong textual reasoning pretraining or has a small context window.

When modality-specific, high-quality labeled preference data already exist and are affordable.

Failure Modes

Judge follows length or position biases without mitigation, skewing rankings.

Overfitting to on-policy reasoning samples causes drop in modality perception (catastrophic forgetting).

Core Entities

Models

FLEX-Omni-7BFLEX-VL-7BFLEX-Mol-LLaMAQwen2.5-VL-7BQwen2.5-Omni-7BJudgeLRM-7BMol-LLaMAGPT-4oGemini-1.5-ProLLaVA-Critic-7BPrometheus-Vision-13BQwen2.5-VL-3B

Metrics

Pearson correlationAccuracyNormalized Levenshtein distanceLinear correlation coefficient (LCC)Spearman rank correlation (SRCC)

Datasets

JudgeLM-100KMLLM-as-a-JudgeVL-RewardBenchMJ-BenchGenAI-BenchNISQABVCCSOMOSVoxSimRLHF-VJudgeAnythingMMRB

Benchmarks

MLLM-as-a-JudgeVL-RewardBenchMJ-BenchGenAI-BenchAudio MOS/SS (NISQA, BVCC, SOMOS, VoxSim)MMRBJudgeAnything