Comprehensive eval finds Gemini close to GPT‑3.5 on language commonsense, behind GPT‑4 and GPT‑4V on multimodal tasks

December 29, 20237 min

Overview

Decision SnapshotReady For Pilot

The evaluation uses multiple public benchmarks and manual rationale checks, so findings are grounded for current APIs but limited by sampling and English-only datasets.

Citations14

Evidence Strength0.80

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 65%

Novelty: 25%

Authors

Yuqing Wang, Yun Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Gemini Pro is close to GPT‑3.5 for language commonsense but behind GPT‑4; pick models based on accuracy needs and multimodal complexity.

Who Should Care

Summary TLDR

This paper evaluates Google's Gemini (Pro and Vision) on 12 commonsense datasets (11 language, 1 multimodal) against Llama2-70B, GPT‑3.5 Turbo, GPT‑4 Turbo, and GPT‑4V. Gemini Pro matches or slightly exceeds GPT‑3.5 on language benchmarks (avg ≈79.2% vs 78.2%) but trails GPT‑4 by ~8% on the same sets. In vision tasks (VCR), Gemini Pro Vision scores lower than GPT‑4V (Q→A 74 vs 80), with particular weaknesses in social/temporal reasoning and emotion recognition. Data and code are on GitHub.

Problem Statement

Benchmarks suggested Gemini was weaker at commonsense reasoning, but prior checks used limited data. The paper asks: how well does Gemini (language and vision) handle diverse commonsense tasks when tested across a broader, multi-dataset suite?

Main Contribution

Systematic evaluation of Gemini Pro and Gemini Pro Vision on 12 commonsense datasets (11 language, 1 visual).

Head-to-head comparison with Llama2-70B, GPT‑3.5 Turbo, GPT‑4 Turbo, and GPT‑4V under 0-shot and few-shot CoT prompts.

Key Findings

Gemini Pro's language-only accuracy is similar to GPT‑3.5 Turbo.

NumbersAvg acc Gemini Pro 79.2% vs GPT‑3.5 78.2% on 11 language datasets

Practical UseFor many language-only commonsense tasks, Gemini Pro is a practical alternative to GPT‑3.5; prefer GPT‑4 only when higher accuracy is required.

Evidence RefTable 2

GPT‑4 Turbo leads by a clear margin on evaluated commonsense benchmarks.

NumbersGPT‑4 Turbo avg acc 88.1% (0-shot) vs Gemini Pro 79.2% (0-shot)

Practical UseIf task-critical correctness matters, choose GPT‑4 class models over Gemini for the present-generation APIs.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AccuracyGemini Pro 79.2% (0-shot avg); GPT‑4 Turbo 88.1% (0-shot avg)GPT‑4 Turbo−8.9% vs GPT‑4 (0-shot)11 language datasets (see Table 2)Table 2 reports per-dataset and averaged accuracy across modelsTable 2
Per-model average (k-shot CoT)Gemini Pro 82.1% (k-shot), GPT‑4 Turbo 89.5% (k-shot)GPT‑4 Turbo−7.4% vs GPT‑4 (k-shot)11 language datasets with CoT promptingTable 2 k-shot CoT column averagesTable 2

What To Try In 7 Days

Run your task's 200–500 samples through Gemini Pro and GPT‑3.5 to compare costs and accuracy.

If visuals matter, benchmark a representative 50–100 VCR-style image QAs against GPT‑4V and Gemini Vision.

Add a quick rationale check: sample model explanations and assess whether they are trustworthy for your downstream use.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation limited to selected datasets and sampled subsets (200 language, 50 visual examples).

Results are English-only and may not generalize cross-lingually or culturally.

When Not To Use

High-stakes decisions requiring top-tier accuracy—prefer GPT‑4 class models.

Applications needing reliable image-based emotion recognition without further validation.

Failure Modes

Context misinterpretation causing wrong inferences (common).

Incorrect or missing emotional cues in images (Gemini Vision).

Core Entities

Models

Gemini ProGemini Pro VisionGPT-4 TurboGPT-4VGPT-3.5 TurboLlama-2-70b-chat

Metrics

Accuracy

Datasets

CommonsenseQACosmos QAα NLIHellaSWAGTRAMNumerSensePIQAQASCRiddleSenseSocial IQaETHICSVCR

Benchmarks

VCRHellaSWAGCommonsenseQATRAM