Comprehensive eval finds Gemini close to GPT‑3.5 on language commonsense, behind GPT‑4 and GPT‑4V on multimodal tasks

December 29, 20237 min

Overview

Production Readiness

0.65

Novelty Score

0.25

Cost Impact Score

0.4

Citation Count

14

Authors

Yuqing Wang, Yun Zhao

Links

Abstract / PDF

Why It Matters For Business

Gemini Pro is close to GPT‑3.5 for language commonsense but behind GPT‑4; pick models based on accuracy needs and multimodal complexity.

Summary TLDR

This paper evaluates Google's Gemini (Pro and Vision) on 12 commonsense datasets (11 language, 1 multimodal) against Llama2-70B, GPT‑3.5 Turbo, GPT‑4 Turbo, and GPT‑4V. Gemini Pro matches or slightly exceeds GPT‑3.5 on language benchmarks (avg ≈79.2% vs 78.2%) but trails GPT‑4 by ~8% on the same sets. In vision tasks (VCR), Gemini Pro Vision scores lower than GPT‑4V (Q→A 74 vs 80), with particular weaknesses in social/temporal reasoning and emotion recognition. Data and code are on GitHub.

Problem Statement

Benchmarks suggested Gemini was weaker at commonsense reasoning, but prior checks used limited data. The paper asks: how well does Gemini (language and vision) handle diverse commonsense tasks when tested across a broader, multi-dataset suite?

Main Contribution

Systematic evaluation of Gemini Pro and Gemini Pro Vision on 12 commonsense datasets (11 language, 1 visual).

Head-to-head comparison with Llama2-70B, GPT‑3.5 Turbo, GPT‑4 Turbo, and GPT‑4V under 0-shot and few-shot CoT prompts.

Error and manual reasoning analysis that highlights weaknesses in temporal, social, and emotion-recognition reasoning.

Key Findings

Gemini Pro's language-only accuracy is similar to GPT‑3.5 Turbo.

NumbersAvg acc Gemini Pro 79.2% vs GPT‑3.5 78.2% on 11 language datasets

GPT‑4 Turbo leads by a clear margin on evaluated commonsense benchmarks.

NumbersGPT‑4 Turbo avg acc 88.1% (0-shot) vs Gemini Pro 79.2% (0-shot)

Gemini Pro Vision lags GPT‑4V on the VCR visual commonsense benchmark.

NumbersVCR Q→A: GPT‑4V 80 vs Gemini Pro Vision 74; Q→AR: 56 vs 48

A substantial share of Gemini Pro's explanations are judged logically sound.

Numbers≈65.8% of Gemini Pro reasoning labeled correct in sampled justifications

Common error modes: context misinterpretation and emotion-recognition failures in vision.

NumbersContext misinterpretation ~30.2% of Gemini errors; emotion errors ~32.6% for Gemini Vision

Results

Accuracy

ValueGemini Pro 79.2% (0-shot avg); GPT‑4 Turbo 88.1% (0-shot avg)

BaselineGPT‑4 Turbo

Per-model average (k-shot CoT)

ValueGemini Pro 82.1% (k-shot), GPT‑4 Turbo 89.5% (k-shot)

BaselineGPT‑4 Turbo

VCR visual commonsense (Q → A, QA → R, Q → AR)

ValueGPT‑4V: 80 / 72 / 56 ; Gemini Pro Vision: 74 / 70 / 48

BaselineGPT‑4V

Reasoning justification quality (sampled explanations)

ValueGemini Pro: ≈65.8% explanations judged logically sound

Who Should Care

What To Try In 7 Days

Run your task's 200–500 samples through Gemini Pro and GPT‑3.5 to compare costs and accuracy.

If visuals matter, benchmark a representative 50–100 VCR-style image QAs against GPT‑4V and Gemini Vision.

Add a quick rationale check: sample model explanations and assess whether they are trustworthy for your downstream use.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to selected datasets and sampled subsets (200 language, 50 visual examples).
  • Results are English-only and may not generalize cross-lingually or culturally.
  • APIs and closed models can change; this is a snapshot of available versions.

When Not To Use

  • High-stakes decisions requiring top-tier accuracy—prefer GPT‑4 class models.
  • Applications needing reliable image-based emotion recognition without further validation.
  • Temporal or nuanced social reasoning scenarios needing precise context handling.

Failure Modes

  • Context misinterpretation causing wrong inferences (common).
  • Incorrect or missing emotional cues in images (Gemini Vision).
  • Temporal reasoning gaps on ambiguous or underspecified timelines.
  • Model refusals to answer counted as incorrect in evaluations.

Core Entities

Models

  • Gemini Pro
  • Gemini Pro Vision
  • GPT-4 Turbo
  • GPT-4V
  • GPT-3.5 Turbo
  • Llama-2-70b-chat

Metrics

  • Accuracy

Datasets

  • CommonsenseQA
  • Cosmos QA
  • α NLI
  • HellaSWAG
  • TRAM
  • NumerSense
  • PIQA
  • QASC
  • RiddleSense
  • Social IQa
  • ETHICS
  • VCR

Benchmarks

  • VCR
  • HellaSWAG
  • CommonsenseQA
  • TRAM