Overview
This is a pragmatic, broadly scoped benchmark with both quantitative tests and human Arena; it shows clear failure modes (overfitting, hallucination) and offers mitigation (multi-turn reasoning), but Arena needs ongoing human labeling and CIDEr-style metrics are brittle.
Citations20
Evidence Strength0.85
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 50%
Novelty: 55%
Why It Matters For Business
Benchmark scores can be misleading: high in-domain numbers often mean overfitting and worse open-world behavior; evaluate models with human-in-the-loop tests and targeted hallucination probes before deploying.
Who Should Care
Summary TLDR
This paper builds LVLM-eHub: a public benchmark and an online arena to evaluate eight large vision-language models (LVLMs) across six capabilities (visual perception, knowledge, reasoning, commonsense, object hallucination, embodied intelligence). Quantitative tests use 47+ text-based visual datasets and standard metrics. A crowd-sourced 1v1 Arena collects human judgments on open-world queries. Key findings: heavy in-domain instruction tuning (InstructBLIP) can overfit; moderate instruction-tuning often causes object hallucination and breaks metrics like CIDEr; multi-turn reasoning (asking sub-questions + re-evaluation) reduces hallucination. The platform and pipelines are intended for wider
Problem Statement
There is no single, broad, and practical evaluation that measures how large vision-language models behave across diverse real-world tasks and human-facing use. Existing studies focus on parts (OCR, hallucination, commonsense) and miss combined quantitative and human-in-the-loop assessments.
Main Contribution
LVLM-eHub: a unified evaluation hub combining quantitative tests (47+ visual text datasets) and an online human Arena.
A zero-shot capability suite covering six multimodal categories: perception, knowledge, reasoning, commonsense, object hallucination, embodied intelligence.
Key Findings
Instruction-tuned models trained on massive in-domain data (InstructBLIP) score highest on many standard benchmarks but generalize poorly in open-world human evaluations.
Moderate instruction-following tuning often increases object hallucination and can make common metrics (CIDEr) unreliable for captions.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Visual Knowledge Acquisition (avg.) | InstructBLIP 0.967 (highest among models) | — | — | See Table 3 (OCR, KIE, ImgCap) | Table 3 average score 0.967 for InstructBLIP | Table 3 |
| Visual Perception (avg.) | InstructBLIP 0.928 (top among LVLMs) | Supervised SOTA >> LVLMs | SOTA >> model (e.g., ImageNet top-1 91.1 vs LVLM ~24) | Table 2 (ImageNet1K & other perception tasks) | Table 2: InstructBLIP avg. score 0.928; ImageNet top-1 accuracy for LVLMs ~24% vs SOTA 91% | Table 2 |
What To Try In 7 Days
Run LVLM-eHub zero-shot suite (or equivalent) on your model to get capability-level baselines.
Probe object hallucination using POPE-like yes/no probes on a representative image set.
Add multi-turn reasoning checks for safety-critical visual queries (ask sub-questions then re-evaluate).
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
CIDEr and automatic metrics often fail to reflect true caption quality for instruction-tuned models (Fig.4).
Models are highly sensitive to prompts; zero-shot numbers vary by prompt (Appendix C.1).
When Not To Use
Do not rely on LVLM-eHub CIDEr scores alone to certify caption quality for instruction-tuned models.
Avoid using only in-domain benchmark results to predict open-world user satisfaction.
Failure Modes
Object hallucination: models invent objects not present in images.
Overfitting to in-domain VQA/instruction data (high benchmark but low open-world performance).

