LVLM-eHub: a practical benchmark and human arena to measure large vision-language models across six multimodal capabilities

Overview

Decision SnapshotReady For Pilot

This is a pragmatic, broadly scoped benchmark with both quantitative tests and human Arena; it shows clear failure modes (overfitting, hallucination) and offers mitigation (multi-turn reasoning), but Arena needs ongoing human labeling and CIDEr-style metrics are brittle.

Citations20

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 50%

Novelty: 55%

Authors

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, Ping Luo

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Benchmark scores can be misleading: high in-domain numbers often mean overfitting and worse open-world behavior; evaluate models with human-in-the-loop tests and targeted hallucination probes before deploying.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This paper builds LVLM-eHub: a public benchmark and an online arena to evaluate eight large vision-language models (LVLMs) across six capabilities (visual perception, knowledge, reasoning, commonsense, object hallucination, embodied intelligence). Quantitative tests use 47+ text-based visual datasets and standard metrics. A crowd-sourced 1v1 Arena collects human judgments on open-world queries. Key findings: heavy in-domain instruction tuning (InstructBLIP) can overfit; moderate instruction-tuning often causes object hallucination and breaks metrics like CIDEr; multi-turn reasoning (asking sub-questions + re-evaluation) reduces hallucination. The platform and pipelines are intended for wider

Problem Statement

There is no single, broad, and practical evaluation that measures how large vision-language models behave across diverse real-world tasks and human-facing use. Existing studies focus on parts (OCR, hallucination, commonsense) and miss combined quantitative and human-in-the-loop assessments.

Main Contribution

LVLM-eHub: a unified evaluation hub combining quantitative tests (47+ visual text datasets) and an online human Arena.

A zero-shot capability suite covering six multimodal categories: perception, knowledge, reasoning, commonsense, object hallucination, embodied intelligence.

Key Findings

Instruction-tuned models trained on massive in-domain data (InstructBLIP) score highest on many standard benchmarks but generalize poorly in open-world human evaluations.

NumbersInstructBLIP avg. scores: Visual Knowledge 0.967 (Table 3); Perception avg. 0.928 (Table 2); Arena rank lower in open‑世界

Practical UseDo not assume top benchmark numbers imply good real-world performance; validate with open-ended human tests (Arena) before deployment.

Evidence RefTable 3, Table 2, Fig.1 and Sec.3.7

Moderate instruction-following tuning often increases object hallucination and can make common metrics (CIDEr) unreliable for captions.

NumbersObject-hallucination metrics: InstructBLIP accuracy ~88.8% vs others lower; CIDEr failures shown in Fig.4 (examples with

Practical UseRun hallucination probes (e.g., POPE) and human checks on captioning and VQA outputs; don't rely on CIDEr alone.

Evidence RefSec.3.5, Table 6, Fig.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Visual Knowledge Acquisition (avg.)	InstructBLIP 0.967 (highest among models)	—	—	See Table 3 (OCR, KIE, ImgCap)	Table 3 average score 0.967 for InstructBLIP	Table 3
Visual Perception (avg.)	InstructBLIP 0.928 (top among LVLMs)	Supervised SOTA >> LVLMs	SOTA >> model (e.g., ImageNet top-1 91.1 vs LVLM ~24)	Table 2 (ImageNet1K & other perception tasks)	Table 2: InstructBLIP avg. score 0.928; ImageNet top-1 accuracy for LVLMs ~24% vs SOTA 91%	Table 2

What To Try In 7 Days

Run LVLM-eHub zero-shot suite (or equivalent) on your model to get capability-level baselines.

Probe object hallucination using POPE-like yes/no probes on a representative image set.

Add multi-turn reasoning checks for safety-critical visual queries (ask sub-questions then re-evaluate).

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

LVLM Arena page (to be released by authors)

Data URLs

Paper lists datasets and splits; specific datasets are public (COCO, ImageNetVC, NoCaps, Flickr30K, etc.)

Risks & Boundaries

Limitations

CIDEr and automatic metrics often fail to reflect true caption quality for instruction-tuned models (Fig.4).

Models are highly sensitive to prompts; zero-shot numbers vary by prompt (Appendix C.1).

When Not To Use

Do not rely on LVLM-eHub CIDEr scores alone to certify caption quality for instruction-tuned models.

Avoid using only in-domain benchmark results to predict open-world user satisfaction.

Failure Modes

Object hallucination: models invent objects not present in images.

Overfitting to in-domain VQA/instruction data (high benchmark but low open-world performance).

Core Entities

Models

BLIP2LLaVALLaMA-Adapter V2MiniGPT-4mPLUG-OwlOtterInstructBLIPVPGTrans

Metrics

AccuracyCIDErMean Reciprocal Rank (MRR)entity-level F1precisionrecallF1-ScoreElo rating

Datasets

ImageNet1KCIFAR10Pets37Flowers102MSCOCONoCapsFlickr30KImageNetVCVCRSNLI-VEDocVQATextVQAOKVQAVisdialCOCO-TextIIIT5KSROIEFUNSDMinecraftVirtualHomeMeta-WorldFranka Kitchen

Benchmarks

LVLM-eHubLVLM ArenaPOPE object-hallucination pipeline

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction-tuned models trained on massive in-domain data (InstructBLIP) score highest on many standard benchmarks but generalize poorly in open-world human evaluations.

Moderate instruction-following tuning often increases object hallucination and can make common metrics (CIDEr) unreliable for captions.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding