Overview
Production Readiness
0.65
Novelty Score
0.25
Cost Impact Score
0.4
Citation Count
14
Why It Matters For Business
Gemini Pro is close to GPT‑3.5 for language commonsense but behind GPT‑4; pick models based on accuracy needs and multimodal complexity.
Summary TLDR
This paper evaluates Google's Gemini (Pro and Vision) on 12 commonsense datasets (11 language, 1 multimodal) against Llama2-70B, GPT‑3.5 Turbo, GPT‑4 Turbo, and GPT‑4V. Gemini Pro matches or slightly exceeds GPT‑3.5 on language benchmarks (avg ≈79.2% vs 78.2%) but trails GPT‑4 by ~8% on the same sets. In vision tasks (VCR), Gemini Pro Vision scores lower than GPT‑4V (Q→A 74 vs 80), with particular weaknesses in social/temporal reasoning and emotion recognition. Data and code are on GitHub.
Problem Statement
Benchmarks suggested Gemini was weaker at commonsense reasoning, but prior checks used limited data. The paper asks: how well does Gemini (language and vision) handle diverse commonsense tasks when tested across a broader, multi-dataset suite?
Main Contribution
Systematic evaluation of Gemini Pro and Gemini Pro Vision on 12 commonsense datasets (11 language, 1 visual).
Head-to-head comparison with Llama2-70B, GPT‑3.5 Turbo, GPT‑4 Turbo, and GPT‑4V under 0-shot and few-shot CoT prompts.
Error and manual reasoning analysis that highlights weaknesses in temporal, social, and emotion-recognition reasoning.
Key Findings
Gemini Pro's language-only accuracy is similar to GPT‑3.5 Turbo.
GPT‑4 Turbo leads by a clear margin on evaluated commonsense benchmarks.
Gemini Pro Vision lags GPT‑4V on the VCR visual commonsense benchmark.
A substantial share of Gemini Pro's explanations are judged logically sound.
Common error modes: context misinterpretation and emotion-recognition failures in vision.
Results
Accuracy
Per-model average (k-shot CoT)
VCR visual commonsense (Q → A, QA → R, Q → AR)
Reasoning justification quality (sampled explanations)
Who Should Care
What To Try In 7 Days
Run your task's 200–500 samples through Gemini Pro and GPT‑3.5 to compare costs and accuracy.
If visuals matter, benchmark a representative 50–100 VCR-style image QAs against GPT‑4V and Gemini Vision.
Add a quick rationale check: sample model explanations and assess whether they are trustworthy for your downstream use.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to selected datasets and sampled subsets (200 language, 50 visual examples).
- Results are English-only and may not generalize cross-lingually or culturally.
- APIs and closed models can change; this is a snapshot of available versions.
When Not To Use
- High-stakes decisions requiring top-tier accuracy—prefer GPT‑4 class models.
- Applications needing reliable image-based emotion recognition without further validation.
- Temporal or nuanced social reasoning scenarios needing precise context handling.
Failure Modes
- Context misinterpretation causing wrong inferences (common).
- Incorrect or missing emotional cues in images (Gemini Vision).
- Temporal reasoning gaps on ambiguous or underspecified timelines.
- Model refusals to answer counted as incorrect in evaluations.
Core Entities
Models
- Gemini Pro
- Gemini Pro Vision
- GPT-4 Turbo
- GPT-4V
- GPT-3.5 Turbo
- Llama-2-70b-chat
Metrics
- Accuracy
Datasets
- CommonsenseQA
- Cosmos QA
- α NLI
- HellaSWAG
- TRAM
- NumerSense
- PIQA
- QASC
- RiddleSense
- Social IQa
- ETHICS
- VCR
Benchmarks
- VCR
- HellaSWAG
- CommonsenseQA
- TRAM

