LVLM-eHub: a practical benchmark and human arena to measure large vision-language models across six multimodal capabilities

June 15, 20238 min

Overview

Decision SnapshotReady For Pilot

This is a pragmatic, broadly scoped benchmark with both quantitative tests and human Arena; it shows clear failure modes (overfitting, hallucination) and offers mitigation (multi-turn reasoning), but Arena needs ongoing human labeling and CIDEr-style metrics are brittle.

Citations20

Evidence Strength0.85

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 50%

Novelty: 55%

Authors

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, Ping Luo

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Benchmark scores can be misleading: high in-domain numbers often mean overfitting and worse open-world behavior; evaluate models with human-in-the-loop tests and targeted hallucination probes before deploying.

Who Should Care

Summary TLDR

This paper builds LVLM-eHub: a public benchmark and an online arena to evaluate eight large vision-language models (LVLMs) across six capabilities (visual perception, knowledge, reasoning, commonsense, object hallucination, embodied intelligence). Quantitative tests use 47+ text-based visual datasets and standard metrics. A crowd-sourced 1v1 Arena collects human judgments on open-world queries. Key findings: heavy in-domain instruction tuning (InstructBLIP) can overfit; moderate instruction-tuning often causes object hallucination and breaks metrics like CIDEr; multi-turn reasoning (asking sub-questions + re-evaluation) reduces hallucination. The platform and pipelines are intended for wider

Problem Statement

There is no single, broad, and practical evaluation that measures how large vision-language models behave across diverse real-world tasks and human-facing use. Existing studies focus on parts (OCR, hallucination, commonsense) and miss combined quantitative and human-in-the-loop assessments.

Main Contribution

LVLM-eHub: a unified evaluation hub combining quantitative tests (47+ visual text datasets) and an online human Arena.

A zero-shot capability suite covering six multimodal categories: perception, knowledge, reasoning, commonsense, object hallucination, embodied intelligence.

Key Findings

Instruction-tuned models trained on massive in-domain data (InstructBLIP) score highest on many standard benchmarks but generalize poorly in open-world human evaluations.

NumbersInstructBLIP avg. scores: Visual Knowledge 0.967 (Table 3); Perception avg. 0.928 (Table 2); Arena rank lower in open‑世界

Practical UseDo not assume top benchmark numbers imply good real-world performance; validate with open-ended human tests (Arena) before deployment.

Evidence RefTable 3, Table 2, Fig.1 and Sec.3.7

Moderate instruction-following tuning often increases object hallucination and can make common metrics (CIDEr) unreliable for captions.

NumbersObject-hallucination metrics: InstructBLIP accuracy ~88.8% vs others lower; CIDEr failures shown in Fig.4 (examples with

Practical UseRun hallucination probes (e.g., POPE) and human checks on captioning and VQA outputs; don't rely on CIDEr alone.

Evidence RefSec.3.5, Table 6, Fig.4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Visual Knowledge Acquisition (avg.)InstructBLIP 0.967 (highest among models)See Table 3 (OCR, KIE, ImgCap)Table 3 average score 0.967 for InstructBLIPTable 3
Visual Perception (avg.)InstructBLIP 0.928 (top among LVLMs)Supervised SOTA >> LVLMsSOTA >> model (e.g., ImageNet top-1 91.1 vs LVLM ~24)Table 2 (ImageNet1K & other perception tasks)Table 2: InstructBLIP avg. score 0.928; ImageNet top-1 accuracy for LVLMs ~24% vs SOTA 91%Table 2

What To Try In 7 Days

Run LVLM-eHub zero-shot suite (or equivalent) on your model to get capability-level baselines.

Probe object hallucination using POPE-like yes/no probes on a representative image set.

Add multi-turn reasoning checks for safety-critical visual queries (ask sub-questions then re-evaluate).

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

LVLM Arena page (to be released by authors)

Data URLs

Paper lists datasets and splits; specific datasets are public (COCO, ImageNetVC, NoCaps, Flickr30K, etc.)

Risks & Boundaries

Limitations

CIDEr and automatic metrics often fail to reflect true caption quality for instruction-tuned models (Fig.4).

Models are highly sensitive to prompts; zero-shot numbers vary by prompt (Appendix C.1).

When Not To Use

Do not rely on LVLM-eHub CIDEr scores alone to certify caption quality for instruction-tuned models.

Avoid using only in-domain benchmark results to predict open-world user satisfaction.

Failure Modes

Object hallucination: models invent objects not present in images.

Overfitting to in-domain VQA/instruction data (high benchmark but low open-world performance).

Core Entities

Models

BLIP2LLaVALLaMA-Adapter V2MiniGPT-4mPLUG-OwlOtterInstructBLIPVPGTrans

Metrics

AccuracyCIDErMean Reciprocal Rank (MRR)entity-level F1precisionrecallF1-ScoreElo rating

Datasets

ImageNet1KCIFAR10Pets37Flowers102MSCOCONoCapsFlickr30KImageNetVCVCRSNLI-VEDocVQATextVQAOKVQAVisdialCOCO-TextIIIT5KSROIEFUNSDMinecraftVirtualHomeMeta-WorldFranka Kitchen

Benchmarks

LVLM-eHubLVLM ArenaPOPE object-hallucination pipeline