A tiny common-sense math prompt exposes dramatic, inconsistent reasoning in many SOTA LLMs

Overview

Decision SnapshotNeeds Validation

The paper shows robust empirical evidence that many models produce fragile reasoning: correct chains appear but are not reliably accessible and break under trivial, irrelevant changes.

Citations20

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 20%

Novelty: 40%

Authors

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.

Who Should Care

CTO ML Engineer Product Manager CEO Founder Data Scientist

Summary TLDR

The authors introduce a tiny, unambiguous common-sense problem (``Alice has N brothers and M sisters; how many sisters does Alice's brother have?'') and many closely related variations. Across hundreds of trials they find that most state-of-the-art LLMs either fail altogether or show wildly different success rates on variations that do not change problem structure. Control tasks show models can parse language and do arithmetic, so failures point to brittle generalization and fragile reasoning. Standard benchmarks (e.g., MMLU, GSM8K) often miss these defects. The paper releases code and raw responses and proposes AIW-style variation testing as a compact stress test for reasoning robustness.

Problem Statement

Do modern LLMs robustly generalize basic, everyday reasoning? The paper asks whether top models can reliably solve a very short, clear common-sense math word problem and behave consistently when only irrelevant numeric details change. If models truly generalize, answers should be stable across these trivial variations.

Main Contribution

Define AIW: a minimal common-sense problem template and multiple natural variations that preserve problem structure.

Systematic evaluation of many SOTA models (closed and open) using repeated trials per variation and three prompt styles.

Key Findings

Most SOTA models fail or perform inconsistently on a simple common-sense problem.

NumbersMajority of models p_correct < 0.2; GPT‑4o p=0.649, Claude 3 Opus p=0.431, many models p≈0

Practical UseDon't trust high benchmark scores alone—add targeted robustness checks for trivial, unambiguous prompts before deployment.

Evidence RefFig.3, Fig.33, Sec.3.1

Performance swings dramatically across trivial, structure-preserving variations.

NumbersCorrect rate can vary ~0→1 across variations for same model (e.g., GPT‑4 shows near 0 on variation 3 and near 1 on v4)

Practical UseRun many randomized or permuted instances of the same task; single-instance pass/fail or averaged score can hide catastrophic brittleness.

Evidence RefFig.1, Fig.4, Sec.3.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
correct response rate (averaged over AIW variations 1-4 and prompts)	GPT-4o = 0.649	—	—	AIW variations 1-4 (avg)	Fig.3, Sec.3.1	—
correct response rate (averaged over AIW variations 1-4 and prompts)	Claude 3 Opus = 0.431	—	—	AIW variations 1-4 (avg)	Fig.3, Sec.3.1	—

What To Try In 7 Days

Run an AIW-style test suite (simple templates plus many number permutations) against your production model.

Add control tasks to confirm failures are high-level (not parsing or arithmetic).

Treat fluent confidence as unreliable; add deterministic verification or a secondary verifier for critical numeric outputs.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

AIW repo (referenced in paper; raw responses and code released)

Data URLs

AIW repo (collected raw response data referenced in paper)

Risks & Boundaries

Limitations

Possible training/test leakage for public AIW original instances (authors note evidence for fine-tuned models benefiting on some AIW variants).

AIW family is narrow: focuses on relational counting templates and small-number variations, not broad cognitive tasks.

When Not To Use

As the only evaluation for model capability — AIW is a focused stress test, not a comprehensive benchmark for all tasks.

To evaluate tokenization/low-level parsing — control tasks already show those are not the issue.

Failure Modes

Fragile generalization: correct solution appears inconsistently across trivial variations.

Overconfidence and confabulation: fluent but wrong explanations that mislead users.

Core Entities

Models

GPT-4oGPT-4GPT-4-0613GPT-4-turboGPT-3.5-turboClaude 3 OpusClaude 3.5 SonnetLlama 3.1 405BLlama 3 70BLlama 3 8BLlama 2 70BMistral-7BMixtralQwen 2.5 72BCommand R+Dbrx InstructDeepSeek R1o1-previewo1-miniNuminaMath-7B

Metrics

correct response rateunified robustness score Rfrequency distribution of numeric outputsvariance across variations

Datasets

AIW (Alice in Wonderland) variationsAIW Light control problemsAIW+, AIW Ext, AIW Friends, AIW Colleague Circles

Benchmarks

MMLUGSM8KARCHellaSwagMATHAIME

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Most SOTA models fail or perform inconsistently on a simple common-sense problem.

Performance swings dramatically across trivial, structure-preserving variations.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LEXAM — 340 real law exams, 4.9k questions, and an expert-validated LLM judge for legal reasoning

Key finding

MULTICOM: a multilingual commonsense generation benchmark showing LLMs are better in English

Key finding

ID-MoCQA: 15,590 bilingual Indonesian multi-hop cultural QA items show models can identify regions but fail at situational cultural answers

Key finding

ERI: 57,750 engineering instruction-response items across 9 fields to test LLM reasoning and agent tool-use

Key finding

ElecBench — a domain benchmark that tests LLMs on power-dispatch scenarios across six practical metrics.

Key finding