Overview
Production Readiness
0.2
Novelty Score
0.4
Cost Impact Score
0.2
Citation Count
20
Why It Matters For Business
High benchmark scores can hide brittle model behavior. Simple checks with structure-preserving variations catch failures that matter for reliability, safety, and customer trust.
Summary TLDR
The authors introduce a tiny, unambiguous common-sense problem (``Alice has N brothers and M sisters; how many sisters does Alice's brother have?'') and many closely related variations. Across hundreds of trials they find that most state-of-the-art LLMs either fail altogether or show wildly different success rates on variations that do not change problem structure. Control tasks show models can parse language and do arithmetic, so failures point to brittle generalization and fragile reasoning. Standard benchmarks (e.g., MMLU, GSM8K) often miss these defects. The paper releases code and raw responses and proposes AIW-style variation testing as a compact stress test for reasoning robustness.
Problem Statement
Do modern LLMs robustly generalize basic, everyday reasoning? The paper asks whether top models can reliably solve a very short, clear common-sense math word problem and behave consistently when only irrelevant numeric details change. If models truly generalize, answers should be stable across these trivial variations.
Main Contribution
Define AIW: a minimal common-sense problem template and multiple natural variations that preserve problem structure.
Systematic evaluation of many SOTA models (closed and open) using repeated trials per variation and three prompt styles.
Control experiments (AIW Light family) that rule out low-level failures like tokenization or arithmetic.
Introduce a simple unified robustness score R that penalizes both low accuracy and uneven performance across variations.
Release code and raw response data to reproduce experiments.
Key Findings
Most SOTA models fail or perform inconsistently on a simple common-sense problem.
Performance swings dramatically across trivial, structure-preserving variations.
Failures are not due to low-level parsing or basic math.
Wrong answers are often accompanied by confident, plausible-sounding explanations (confabulations).
Results
correct response rate (averaged over AIW variations 1-4 and prompts)
correct response rate (averaged over AIW variations 1-4 and prompts)
correct response rate (averaged over AIW variations 1-4 and prompts)
control-task correctness (AIW Light problems)
unified robustness score R (example best)
Who Should Care
What To Try In 7 Days
Run an AIW-style test suite (simple templates plus many number permutations) against your production model.
Add control tasks to confirm failures are high-level (not parsing or arithmetic).
Treat fluent confidence as unreliable; add deterministic verification or a secondary verifier for critical numeric outputs.
Reproducibility
Code Urls
- AIW repo (referenced in paper; raw responses and code released)
Data Urls
- AIW repo (collected raw response data referenced in paper)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Possible training/test leakage for public AIW original instances (authors note evidence for fine-tuned models benefiting on some AIW variants).
- AIW family is narrow: focuses on relational counting templates and small-number variations, not broad cognitive tasks.
- Some quantitative claims (frequency of confabulation) are qualitative; not all failure modes have exact rates reported.
When Not To Use
- As the only evaluation for model capability — AIW is a focused stress test, not a comprehensive benchmark for all tasks.
- To evaluate tokenization/low-level parsing — control tasks already show those are not the issue.
- For multimodal or long-horizon planning capabilities; AIW targets short, symbolic/common-sense reasoning.
Failure Modes
- Fragile generalization: correct solution appears inconsistently across trivial variations.
- Overconfidence and confabulation: fluent but wrong explanations that mislead users.
- Outlier-driven averages: single-variation successes can mask overall brittleness.
- Resistance to self-correction: models often fail to revise wrong answers when prompted.
Core Entities
Models
- GPT-4o
- GPT-4
- GPT-4-0613
- GPT-4-turbo
- GPT-3.5-turbo
- Claude 3 Opus
- Claude 3.5 Sonnet
- Llama 3.1 405B
- Llama 3 70B
- Llama 3 8B
- Llama 2 70B
- Mistral-7B
- Mixtral
- Qwen 2.5 72B
- Command R+
- Dbrx Instruct
- DeepSeek R1
- o1-preview
- o1-mini
- NuminaMath-7B
Metrics
- correct response rate
- unified robustness score R
- frequency distribution of numeric outputs
- variance across variations
Datasets
- AIW (Alice in Wonderland) variations
- AIW Light control problems
- AIW+, AIW Ext, AIW Friends, AIW Colleague Circles
Benchmarks
- MMLU
- GSM8K
- ARC
- HellaSwag
- MATH
- AIME

