A tiny common-sense math prompt exposes dramatic, inconsistent reasoning in many SOTA LLMs

June 4, 20248 min

Overview

Production Readiness

0.2

Novelty Score

0.4

Cost Impact Score

0.2

Citation Count

20

Authors

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev

Links

Abstract / PDF

Why It Matters For Business

High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.

Summary TLDR

The authors introduce a tiny, unambiguous common-sense problem (``Alice has N brothers and M sisters; how many sisters does Alice's brother have?'') and many closely related variations. Across hundreds of trials they find that most state-of-the-art LLMs either fail altogether or show wildly different success rates on variations that do not change problem structure. Control tasks show models can parse language and do arithmetic, so failures point to brittle generalization and fragile reasoning. Standard benchmarks (e.g., MMLU, GSM8K) often miss these defects. The paper releases code and raw responses and proposes AIW-style variation testing as a compact stress test for reasoning robustness.

Problem Statement

Do modern LLMs robustly generalize basic, everyday reasoning? The paper asks whether top models can reliably solve a very short, clear common-sense math word problem and behave consistently when only irrelevant numeric details change. If models truly generalize, answers should be stable across these trivial variations.

Main Contribution

Define AIW: a minimal common-sense problem template and multiple natural variations that preserve problem structure.

Systematic evaluation of many SOTA models (closed and open) using repeated trials per variation and three prompt styles.

Control experiments (AIW Light family) that rule out low-level failures like tokenization or arithmetic.

Introduce a simple unified robustness score R that penalizes both low accuracy and uneven performance across variations.

Release code and raw response data to reproduce experiments.

Key Findings

Most SOTA models fail or perform inconsistently on a simple common-sense problem.

NumbersMajority of models p_correct < 0.2; GPT‑4o p=0.649, Claude 3 Opus p=0.431, many models p≈0

Performance swings dramatically across trivial, structure-preserving variations.

NumbersCorrect rate can vary ~0→1 across variations for same model (e.g., GPT‑4 shows near 0 on variation 3 and near 1 on v4)

Failures are not due to low-level parsing or basic math.

NumbersOn AIW Light control tasks most models reach near 1.0 correct across variations

Wrong answers are often accompanied by confident, plausible-sounding explanations (confabulations).

Results

correct response rate (averaged over AIW variations 1-4 and prompts)

ValueGPT-4o = 0.649

correct response rate (averaged over AIW variations 1-4 and prompts)

ValueClaude 3 Opus = 0.431

correct response rate (averaged over AIW variations 1-4 and prompts)

ValueLlama 2 70B Chat = 0.30

control-task correctness (AIW Light problems)

ValueMany models ≈ 1.0 correct across variations

unified robustness score R (example best)

Valueo1-preview ≈ 0.9 (others < 0.5)

Who Should Care

What To Try In 7 Days

Run an AIW-style test suite (simple templates plus many number permutations) against your production model.

Add control tasks to confirm failures are high-level (not parsing or arithmetic).

Treat fluent confidence as unreliable; add deterministic verification or a secondary verifier for critical numeric outputs.

Reproducibility

Code Urls

  • AIW repo (referenced in paper; raw responses and code released)

Data Urls

  • AIW repo (collected raw response data referenced in paper)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Possible training/test leakage for public AIW original instances (authors note evidence for fine-tuned models benefiting on some AIW variants).
  • AIW family is narrow: focuses on relational counting templates and small-number variations, not broad cognitive tasks.
  • Some quantitative claims (frequency of confabulation) are qualitative; not all failure modes have exact rates reported.

When Not To Use

  • As the only evaluation for model capability — AIW is a focused stress test, not a comprehensive benchmark for all tasks.
  • To evaluate tokenization/low-level parsing — control tasks already show those are not the issue.
  • For multimodal or long-horizon planning capabilities; AIW targets short, symbolic/common-sense reasoning.

Failure Modes

  • Fragile generalization: correct solution appears inconsistently across trivial variations.
  • Overconfidence and confabulation: fluent but wrong explanations that mislead users.
  • Outlier-driven averages: single-variation successes can mask overall brittleness.
  • Resistance to self-correction: models often fail to revise wrong answers when prompted.

Core Entities

Models

  • GPT-4o
  • GPT-4
  • GPT-4-0613
  • GPT-4-turbo
  • GPT-3.5-turbo
  • Claude 3 Opus
  • Claude 3.5 Sonnet
  • Llama 3.1 405B
  • Llama 3 70B
  • Llama 3 8B
  • Llama 2 70B
  • Mistral-7B
  • Mixtral
  • Qwen 2.5 72B
  • Command R+
  • Dbrx Instruct
  • DeepSeek R1
  • o1-preview
  • o1-mini
  • NuminaMath-7B

Metrics

  • correct response rate
  • unified robustness score R
  • frequency distribution of numeric outputs
  • variance across variations

Datasets

  • AIW (Alice in Wonderland) variations
  • AIW Light control problems
  • AIW+, AIW Ext, AIW Friends, AIW Colleague Circles

Benchmarks

  • MMLU
  • GSM8K
  • ARC
  • HellaSwag
  • MATH
  • AIME