Relying on LLMs' apparent commonsense reasoning can be risky: models may fail on small, realistic changes and produce misleading outputs in user-facing scenarios.
Key finding
Making an opaque container transparent causes GPT-3.5 to predict the agent believes the wrong content.
Numbers: Variation 1A: P(chocolate)=95% vs P(popcorn)=0%

