Many top multimodal LLMs ignore explicit 'no' constraints and still draw the excluded object

Overview

Decision SnapshotNeeds Validation

The paper documents an important failure pattern with clear examples and counts, but the limited model set, small sample size and lack of root-cause analysis lower production readiness and evidence strength.

Citations7

Evidence Strength0.50

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 40%

Authors

Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Björn W. Schuller, Amir Hussain

Links

Abstract / PDF

Why It Matters For Business

If your product depends on images that must exclude certain content (safety, branding, legal), current multimodal LLMs can silently fail and even claim they succeeded; add verification or blocklisting before shipping.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

This paper documents a reproducible failure mode for multimodal LLMs: when prompted to generate images with explicit exclusions (prompts using 'no' / negation), popular models often include the excluded element. The authors tested GPT-4, Gemini, Copilot and Llama-3 on five negation prompts across English, Hindi and French, ran each prompt five times, measured percent incorrect and semantic-entropy, and found frequent errors plus mismatches between the model's textual claim and the actual generated image. Statistical tests showed trends but not strong significance. The authors suggest adding a negation-aware feedback loop between text and image generation as a mitigation direction.

Problem Statement

Modern multimodal LLMs frequently fail to follow simple exclusion instructions in image-generation prompts (phrases using 'no' or negation), producing images that still contain the excluded items and sometimes reporting in text that the exclusion was followed when the image contradicts it.

Main Contribution

Identified and named the 'NO Syndrome'—a pattern where multimodal LLMs ignore explicit negation constraints in image prompts.

Systematic test: five negation prompts run five times across GPT-4, Gemini, Copilot, plus additional checks with Llama-3 in English, Hindi, and French.

Key Findings

For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.

Numbers0/5 correct across tested runs and languages (Section 3.6; Table 1)

Practical UseDo not trust LLM image outputs to enforce critical exclusions; add verification or fallback when exclusion is required.

Evidence RefSection 3.6; Table 1

Popular LLMs (GPT-4, Copilot) frequently produced incorrect images on negation prompts.

NumbersGPT-4 produced 4–5 incorrect images out of 5 on several queries (Table 1)

Practical UseExpect high failure rates on many 'no' prompts; test your specific exclusion cases before deployment.

Evidence RefTable 1; Section 3.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Incorrect image counts per prompt (GPT-4)	Q1:4/5, Q2:5/5, Q3:5/5, Q4:3/5, Q5:2/5	—	—	English prompts (Table 1)	Table 1; Section 3.1	Table 1
Incorrect image counts per prompt (Copilot)	Q1:4/5, Q2:5/5, Q3:4/5, Q4:3/5, Q5:4/5	—	—	English prompts (Table 1)	Table 1; Section 3.1	Table 1

What To Try In 7 Days

Run a small negation test-suite (your critical exclusion cases) and record failure rates.

Add a visual verification step: run a lightweight vision detector to check excluded items in generated images.

If failures occur, implement a rule-based fallback (reject or re-prompt) and log examples for model vendor feedback.

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Only a few LLMs tested (GPT-4, Gemini, Copilot, Llama-3) so results may not generalize to other or future models.

Limited languages (English, Hindi, French) and a small number of prompts and runs.

When Not To Use

Do not rely solely on LLM image outputs for exclusion-sensitive tasks (legal, safety, brand compliance).

Avoid using the model's text confirmation as a guarantee the image obeys constraints.

Failure Modes

Images include items that were explicitly excluded by the prompt.

Textual output claims the exclusion was followed while the image contradicts that claim.

Many top multimodal LLMs ignore explicit 'no' constraints and still draw the excluded object

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.

Popular LLMs (GPT-4, Copilot) frequently produced incorrect images on negation prompts.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Metrics

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.

Popular LLMs (GPT-4, Copilot) frequently produced incorrect images on negation prompts.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Metrics

You May Also Want to Read

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

A meta-agent that auto-generates persona-driven adversarial tests and judges agents to find deeper failures fast

Key finding

LLMsPark: a game-theory benchmark that tests LLMs as strategic, social agents

Key finding