A tiny common-sense math prompt exposes dramatic, inconsistent reasoning in many SOTA LLMs

June 4, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper shows robust empirical evidence that many models produce fragile reasoning: correct chains appear but are not reliably accessible and break under trivial, irrelevant changes.

Citations20

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 20%

Novelty: 40%

Authors

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev

Links

Abstract / PDF / Code / Data

Why It Matters For Business

High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.

Who Should Care

Summary TLDR

The authors introduce a tiny, unambiguous common-sense problem (``Alice has N brothers and M sisters; how many sisters does Alice's brother have?'') and many closely related variations. Across hundreds of trials they find that most state-of-the-art LLMs either fail altogether or show wildly different success rates on variations that do not change problem structure. Control tasks show models can parse language and do arithmetic, so failures point to brittle generalization and fragile reasoning. Standard benchmarks (e.g., MMLU, GSM8K) often miss these defects. The paper releases code and raw responses and proposes AIW-style variation testing as a compact stress test for reasoning robustness.

Problem Statement

Do modern LLMs robustly generalize basic, everyday reasoning? The paper asks whether top models can reliably solve a very short, clear common-sense math word problem and behave consistently when only irrelevant numeric details change. If models truly generalize, answers should be stable across these trivial variations.

Main Contribution

Define AIW: a minimal common-sense problem template and multiple natural variations that preserve problem structure.

Systematic evaluation of many SOTA models (closed and open) using repeated trials per variation and three prompt styles.

Key Findings

Most SOTA models fail or perform inconsistently on a simple common-sense problem.

NumbersMajority of models p_correct < 0.2; GPT‑4o p=0.649, Claude 3 Opus p=0.431, many models p≈0

Practical UseDon't trust high benchmark scores alone—add targeted robustness checks for trivial, unambiguous prompts before deployment.

Evidence RefFig.3, Fig.33, Sec.3.1

Performance swings dramatically across trivial, structure-preserving variations.

NumbersCorrect rate can vary ~01 across variations for same model (e.g., GPT‑4 shows near 0 on variation 3 and near 1 on v4)

Practical UseRun many randomized or permuted instances of the same task; single-instance pass/fail or averaged score can hide catastrophic brittleness.

Evidence RefFig.1, Fig.4, Sec.3.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
correct response rate (averaged over AIW variations 1-4 and prompts)GPT-4o = 0.649AIW variations 1-4 (avg)Fig.3, Sec.3.1
correct response rate (averaged over AIW variations 1-4 and prompts)Claude 3 Opus = 0.431AIW variations 1-4 (avg)Fig.3, Sec.3.1

What To Try In 7 Days

Run an AIW-style test suite (simple templates plus many number permutations) against your production model.

Add control tasks to confirm failures are high-level (not parsing or arithmetic).

Treat fluent confidence as unreliable; add deterministic verification or a secondary verifier for critical numeric outputs.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Code URLs

AIW repo (referenced in paper; raw responses and code released)

Data URLs

AIW repo (collected raw response data referenced in paper)

Risks & Boundaries

Limitations

Possible training/test leakage for public AIW original instances (authors note evidence for fine-tuned models benefiting on some AIW variants).

AIW family is narrow: focuses on relational counting templates and small-number variations, not broad cognitive tasks.

When Not To Use

As the only evaluation for model capability — AIW is a focused stress test, not a comprehensive benchmark for all tasks.

To evaluate tokenization/low-level parsing — control tasks already show those are not the issue.

Failure Modes

Fragile generalization: correct solution appears inconsistently across trivial variations.

Overconfidence and confabulation: fluent but wrong explanations that mislead users.

Core Entities

Models

GPT-4oGPT-4GPT-4-0613GPT-4-turboGPT-3.5-turboClaude 3 OpusClaude 3.5 SonnetLlama 3.1 405BLlama 3 70BLlama 3 8BLlama 2 70BMistral-7BMixtralQwen 2.5 72BCommand R+Dbrx InstructDeepSeek R1o1-previewo1-miniNuminaMath-7B

Metrics

correct response rateunified robustness score Rfrequency distribution of numeric outputsvariance across variations

Datasets

AIW (Alice in Wonderland) variationsAIW Light control problemsAIW+, AIW Ext, AIW Friends, AIW Colleague Circles

Benchmarks

MMLUGSM8KARCHellaSwagMATHAIME