Overview
Production Readiness
0.4
Novelty Score
0.4
Cost Impact Score
0.3
Citation Count
7
Why It Matters For Business
If your product depends on images that must exclude certain content (safety, branding, legal), current multimodal LLMs can silently fail and even claim they succeeded; add verification or blocklisting before shipping.
Summary TLDR
This paper documents a reproducible failure mode for multimodal LLMs: when prompted to generate images with explicit exclusions (prompts using 'no' / negation), popular models often include the excluded element. The authors tested GPT-4, Gemini, Copilot and Llama-3 on five negation prompts across English, Hindi and French, ran each prompt five times, measured percent incorrect and semantic-entropy, and found frequent errors plus mismatches between the model's textual claim and the actual generated image. Statistical tests showed trends but not strong significance. The authors suggest adding a negation-aware feedback loop between text and image generation as a mitigation direction.
Problem Statement
Modern multimodal LLMs frequently fail to follow simple exclusion instructions in image-generation prompts (phrases using 'no' or negation), producing images that still contain the excluded items and sometimes reporting in text that the exclusion was followed when the image contradicts it.
Main Contribution
Identified and named the 'NO Syndrome'—a pattern where multimodal LLMs ignore explicit negation constraints in image prompts.
Systematic test: five negation prompts run five times across GPT-4, Gemini, Copilot, plus additional checks with Llama-3 in English, Hindi, and French.
Quantified failures with per-prompt incorrect counts and semantic-entropy, and ran Friedman and Wilcoxon tests to compare models.
Key Findings
For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.
Popular LLMs (GPT-4, Copilot) frequently produced incorrect images on negation prompts.
Textual confirmation and image output often disagreed: the model's text said the exclusion was followed while the image contained the excluded element.
Model-level differences trended but were not statistically significant at p<0.05.
Copilot showed higher response variability (entropy) and more consistent incorrect responses for some languages.
Results
Incorrect image counts per prompt (GPT-4)
Incorrect image counts per prompt (Copilot)
Friedman test (models on English prompts)
Wilcoxon signed-rank (GPT-4 vs Copilot)
Who Should Care
What To Try In 7 Days
Run a small negation test-suite (your critical exclusion cases) and record failure rates.
Add a visual verification step: run a lightweight vision detector to check excluded items in generated images.
If failures occur, implement a rule-based fallback (reject or re-prompt) and log examples for model vendor feedback.
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Only a few LLMs tested (GPT-4, Gemini, Copilot, Llama-3) so results may not generalize to other or future models.
- Limited languages (English, Hindi, French) and a small number of prompts and runs.
- No causal analysis; the paper does not identify why the failure happens or which subsystem (text vs image) causes it.
- Model updates after the study may change outcomes.
When Not To Use
- Do not rely solely on LLM image outputs for exclusion-sensitive tasks (legal, safety, brand compliance).
- Avoid using the model's text confirmation as a guarantee the image obeys constraints.
Failure Modes
- Images include items that were explicitly excluded by the prompt.
- Textual output claims the exclusion was followed while the image contradicts that claim.
- Language support gaps: some models refused or failed on non-English prompts.
Core Entities
Models
- GPT-4
- Gemini
- Copilot
- Llama-3
Metrics
- percent incorrect
- semantic entropy
Context Entities
Metrics
- Friedman test p-value
- Wilcoxon signed-rank p-value

