Overview
The paper documents an important failure pattern with clear examples and counts, but the limited model set, small sample size and lack of root-cause analysis lower production readiness and evidence strength.
Citations7
Evidence Strength0.50
Confidence0.75
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/4
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 30%
Production readiness: 40%
Novelty: 40%
Why It Matters For Business
If your product depends on images that must exclude certain content (safety, branding, legal), current multimodal LLMs can silently fail and even claim they succeeded; add verification or blocklisting before shipping.
Who Should Care
Summary TLDR
This paper documents a reproducible failure mode for multimodal LLMs: when prompted to generate images with explicit exclusions (prompts using 'no' / negation), popular models often include the excluded element. The authors tested GPT-4, Gemini, Copilot and Llama-3 on five negation prompts across English, Hindi and French, ran each prompt five times, measured percent incorrect and semantic-entropy, and found frequent errors plus mismatches between the model's textual claim and the actual generated image. Statistical tests showed trends but not strong significance. The authors suggest adding a negation-aware feedback loop between text and image generation as a mitigation direction.
Problem Statement
Modern multimodal LLMs frequently fail to follow simple exclusion instructions in image-generation prompts (phrases using 'no' or negation), producing images that still contain the excluded items and sometimes reporting in text that the exclusion was followed when the image contradicts it.
Main Contribution
Identified and named the 'NO Syndrome'—a pattern where multimodal LLMs ignore explicit negation constraints in image prompts.
Systematic test: five negation prompts run five times across GPT-4, Gemini, Copilot, plus additional checks with Llama-3 in English, Hindi, and French.
Key Findings
For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.
Popular LLMs (GPT-4, Copilot) frequently produced incorrect images on negation prompts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Incorrect image counts per prompt (GPT-4) | Q1:4/5, Q2:5/5, Q3:5/5, Q4:3/5, Q5:2/5 | — | — | English prompts (Table 1) | Table 1; Section 3.1 | Table 1 |
| Incorrect image counts per prompt (Copilot) | Q1:4/5, Q2:5/5, Q3:4/5, Q4:3/5, Q5:4/5 | — | — | English prompts (Table 1) | Table 1; Section 3.1 | Table 1 |
What To Try In 7 Days
Run a small negation test-suite (your critical exclusion cases) and record failure rates.
Add a visual verification step: run a lightweight vision detector to check excluded items in generated images.
If failures occur, implement a rule-based fallback (reject or re-prompt) and log examples for model vendor feedback.
Reproducibility
Risks & Boundaries
Limitations
Only a few LLMs tested (GPT-4, Gemini, Copilot, Llama-3) so results may not generalize to other or future models.
Limited languages (English, Hindi, French) and a small number of prompts and runs.
When Not To Use
Do not rely solely on LLM image outputs for exclusion-sensitive tasks (legal, safety, brand compliance).
Avoid using the model's text confirmation as a guarantee the image obeys constraints.
Failure Modes
Images include items that were explicitly excluded by the prompt.
Textual output claims the exclusion was followed while the image contradicts that claim.

