Which LLM and reasoning setup solves Raven-style visual puzzles best?

Overview

Decision SnapshotNeeds Validation

Benchmark is practical and informative but reports best-of-five runs, lacks variance estimates, and uses a single dataset. That limits immediate production trust; useful for rapid prototyping and architecture selection.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 60%

Authors

Sinan Urgun, Seçkin Arı

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Model choice and reasoning setup materially change correctness and failure modes. CoT explanations can be misleading; always validate outputs. Coverage drops (refusals) can hide failures and skew metrics.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

This paper benchmarks four LLMs (GPT-4.1-Mini, Claude-3.5-Haiku, Gemini-1.5-Flash, LLaMA-3.3-70B-q4) across four reasoning architectures on 1,200 RAVEN-FAIR Raven-style problems. Main takeaways: GPT-4.1-Mini achieved the highest peak accuracy (53.9% with embedding-controlled repetition). Chain-of-Thought (CoT) quality did not reliably predict final accuracy. Multi-agent and embedding strategies help some models but can increase numeric errors or coverage drops. Results use best-of-five runs (best-case), not averages.

Problem Statement

Measure how reasoning architecture (single-shot, embedding-repeat, self-reflection, multi-agent) affects LLM ability to solve Raven-style abstract visual puzzles when models must generate answers and render image outputs without being given choices.

Main Contribution

A systematic benchmark of four LLMs across four reasoning architectures on 1,200 RAVEN-FAIR problems with both visual (SSIM/LPIPS) and textual (CoT) evaluation.

Empirical finding that Chain-of-Thought quality often dissociates from answer correctness ('CoT-Accuracy Paradox').

Key Findings

GPT-4.1-Mini achieved the highest peak accuracy among tested models.

Numbers53.92% accuracy (embedding-controlled), 46.91% (single-shot)

Practical UseFor cost-sensitive setups, try GPT-4.1-Mini with embedding repetition first; it gave the best cost-performance in these experiments.

Evidence RefTables 1–2

High Chain-of-Thought (CoT) scores do not guarantee correct answers.

NumbersLLaMA CoT ≈ 8.3 vs accuracy 32.57% (single-shot)

Practical UseDo not use CoT quality alone as a proxy for correctness—always validate final outputs against task-specific metrics.

Evidence RefTable 1, CoT vs Accuracy analysis

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	GPT-4.1-Mini 53.92% (embedding-controlled)	GPT-4.1-Mini single-shot 46.91%	+7.01 pp	RAVEN-FAIR (n=1200)	Table 2: embedding-based architecture	—
Accuracy	LLaMA-3.3-70B 41.33% (multi-agent)	LLaMA single-shot 32.57%	+8.76 pp	RAVEN-FAIR (n=1200)	Table 4: multi-agent results	—

What To Try In 7 Days

Run GPT-4.1-Mini on a representative subset with single-shot and embedding-controlled repetition to compare cost vs accuracy.

Instrument coverage and refusal rates when enabling self-reflection; measure how many examples are lost.

Treat CoT as diagnostic, not proof—add a holdout correctness check for outputs (image similarity or task-specific validator).

Agent Features

Tool Use

JSON extractiongenerate_visual_panel() tool call

Architectures

single-shotembedding-controlled repetitionself-reflectionfeature-based multi-agent

Collaboration

multi-agent feature specialization (shape/color/position/angle)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/SinanUrgunWork/An-Analysis-of-Architectural-Impact-on-LLM-based-Abstract-Visual-Reasoning/tree/main

Data URLs

https://drive.google.com/drive/folders/1Q_YRu5hCFw2c

Risks & Boundaries

Limitations

Results report best-of-five runs (best-case) rather than averages or confidence intervals.

Coverage variation (refusals) changes sample composition across architectures, confounding comparisons.

When Not To Use

Do not assume CoT quality implies correctness—avoid using CoT as the only metric for acceptance.

Avoid self-reflection by default for sensitive pipelines without monitoring coverage and refusal behavior.

Failure Modes

Semantic hallucination: inventing nonexistent patterns (high reported rates).

Numeric misperception: wrong sizes/angles leading to incorrect rendered answers.

Core Entities

Models

GPT-4.1-MiniClaude-3.5-HaikuGemini-1.5-FlashLLaMA-3.3-70B-q4

Metrics

AccuracyCoverageSSIMLPIPSChain-of-Thought ScoreSemantic Hallucination RateNumeric Misperception Rate

Datasets

RAVEN-FAIR (n=1200)

Benchmarks

RAVEN-FAIR

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4.1-Mini achieved the highest peak accuracy among tested models.

High Chain-of-Thought (CoT) scores do not guarantee correct answers.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding