Overview
The paper shows robust empirical evidence that many models produce fragile reasoning: correct chains appear but are not reliably accessible and break under trivial, irrelevant changes.
Citations20
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 20%
Production readiness: 20%
Novelty: 40%
Why It Matters For Business
High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.
Who Should Care
Summary TLDR
The authors introduce a tiny, unambiguous common-sense problem (``Alice has N brothers and M sisters; how many sisters does Alice's brother have?'') and many closely related variations. Across hundreds of trials they find that most state-of-the-art LLMs either fail altogether or show wildly different success rates on variations that do not change problem structure. Control tasks show models can parse language and do arithmetic, so failures point to brittle generalization and fragile reasoning. Standard benchmarks (e.g., MMLU, GSM8K) often miss these defects. The paper releases code and raw responses and proposes AIW-style variation testing as a compact stress test for reasoning robustness.
Problem Statement
Do modern LLMs robustly generalize basic, everyday reasoning? The paper asks whether top models can reliably solve a very short, clear common-sense math word problem and behave consistently when only irrelevant numeric details change. If models truly generalize, answers should be stable across these trivial variations.
Main Contribution
Define AIW: a minimal common-sense problem template and multiple natural variations that preserve problem structure.
Systematic evaluation of many SOTA models (closed and open) using repeated trials per variation and three prompt styles.
Key Findings
Most SOTA models fail or perform inconsistently on a simple common-sense problem.
Performance swings dramatically across trivial, structure-preserving variations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| correct response rate (averaged over AIW variations 1-4 and prompts) | GPT-4o = 0.649 | — | — | AIW variations 1-4 (avg) | Fig.3, Sec.3.1 | — |
| correct response rate (averaged over AIW variations 1-4 and prompts) | Claude 3 Opus = 0.431 | — | — | AIW variations 1-4 (avg) | Fig.3, Sec.3.1 | — |
What To Try In 7 Days
Run an AIW-style test suite (simple templates plus many number permutations) against your production model.
Add control tasks to confirm failures are high-level (not parsing or arithmetic).
Treat fluent confidence as unreliable; add deterministic verification or a secondary verifier for critical numeric outputs.
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Possible training/test leakage for public AIW original instances (authors note evidence for fine-tuned models benefiting on some AIW variants).
AIW family is narrow: focuses on relational counting templates and small-number variations, not broad cognitive tasks.
When Not To Use
As the only evaluation for model capability — AIW is a focused stress test, not a comprehensive benchmark for all tasks.
To evaluate tokenization/low-level parsing — control tasks already show those are not the issue.
Failure Modes
Fragile generalization: correct solution appears inconsistently across trivial variations.
Overconfidence and confabulation: fluent but wrong explanations that mislead users.

