Early comparison shows Google Gemini Pro is a close challenger to GPT-4V on multimodal understanding, with different strengths and common ML

December 19, 20237 min

Overview

Production Readiness

0.65

Novelty Score

0.45

Cost Impact Score

0.6

Citation Count

13

Authors

Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, Xing Sun

Links

Abstract / PDF

Why It Matters For Business

Gemini Pro is a practical, competitive alternative to GPT-4V for many multimodal products; choose the model that matches task needs (cognition/code vs concise multi-domain answers) and test spatial/OCR edge cases before deployment.

Summary TLDR

This paper runs a broad, hands-on comparison of Google Gemini Pro, OpenAI GPT-4V, and an open-source model (Sphinx). Using many curated visual examples plus the MME benchmark, the authors find Gemini is a strong, broadly capable challenger to GPT-4V (Gemini overall 1933.4 vs GPT-4V 1926.6 on MME). GPT-4V leads on higher-level cognition and code reasoning, while Gemini is concise and competitive across domains. Common weaknesses include spatial relations, OCR errors, hallucination, and sensitivity to prompt style. The authors release a tracking repo for follow-up work.

Problem Statement

Can Google’s new multimodal model Gemini Pro match or surpass GPT-4V in visual understanding? The paper compares Gemini, GPT-4V, and an open-source baseline (Sphinx) across many visual tasks to map strengths, failure modes, and practical gaps for real use.

Main Contribution

Large qualitative comparison across four domains: perception, advanced cognition, challenging vision tasks, and expert applications.

Quantitative benchmark using MME (14 sub-tasks) with per-subtask scores and overall comparisons for Gemini, GPT-4V, and Sphinx.

Systematic failure-mode analysis: spatial perception, OCR, hallucination, and prompt robustness, with example-driven evidence and practical notes.

Key Findings

Gemini narrowly outscored GPT-4V on the MME benchmark overall.

NumbersGemini 1933.4 vs GPT-4V 1926.6 overall (MME, higher better)

GPT-4V leads on cognition-heavy sub-tasks, especially code reasoning.

NumbersGPT-4V code reasoning 170.0 vs Gemini 85.0 (MME subtask)

Both closed-source models struggle with spatial position recognition.

NumbersPosition subtask: Sphinx 153.3, GPT-4V 95.0, Gemini 90.0 (MME)

GPT-4V may refuse real-person identification, causing a zero score on celebrity detection.

NumbersGPT-4V celebrity subtask score = 0.0 (refusal policy)

OCR and prompt design affect correctness and stability.

NumbersMultiple qualitative failures in table/chart OCR and contradictory answers under different prompts (Sec.3.1, Fig.41-43)

Results

MME overall score

ValueGemini 1933.4; GPT-4V 1926.6; Sphinx 1870.2

Code reasoning (MME subtask)

ValueGPT-4V 170.0; Gemini 85.0; Sphinx 50.0

Position (spatial) subtask

ValueSphinx 153.3; GPT-4V 95.0; Gemini 90.0

Celebrity recognition

ValueGPT-4V 0.0 (refusal); Gemini 147.4; Sphinx 177.9

Who Should Care

What To Try In 7 Days

Run the MME benchmark subset relevant to your product to compare Gemini and GPT-4V on your use cases.

Stress-test spatial decisions and OCR paths; add a dedicated vision pipeline for critical spatial/text tasks.

Prototype prompt templates (CoT vs direct) and log instability to decide fallback rules for inconsistent answers.

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Qualitative samples are illustrative but not exhaustive; curated selection can bias impressions.
  • MME benchmark alignment may favor models trained on similar public datasets (benefits Sphinx in perception).
  • Closed-source model internals and training details are unavailable, limiting causal analysis.

When Not To Use

  • Do not rely solely on Gemini or GPT-4V for precise spatial reasoning (left/right) or mission-critical navigation decisions.
  • Avoid using these models alone for medical diagnosis or safety-critical defect detection without specialist pipelines and human review.
  • Avoid expecting correct visual OCR for small or low-resolution text without a dedicated OCR preprocessor.

Failure Modes

  • Hallucination: inventing nonexistent details under pressure or misleading prompts.
  • Prompt sensitivity: different prompt formulations produce contradictory outputs.
  • OCR failures: misread characters yielding wrong numeric/logic answers.
  • Spatial insensitivity: poor left/right and fine relative-position judgments.

Core Entities

Models

  • Gemini Pro
  • GPT-4V
  • Sphinx

Metrics

  • overall score (MME)
  • Accuracy
  • subtask scores (e.g., code reasoning, position)

Datasets

  • MMEbenchmark
  • COCO
  • Places
  • Google Landmarks v2
  • MovieNet

Benchmarks

  • MMEbenchmark