Many top multimodal LLMs ignore explicit 'no' constraints and still draw the excluded object

August 27, 20247 min

Overview

Decision SnapshotNeeds Validation

The paper documents an important failure pattern with clear examples and counts, but the limited model set, small sample size and lack of root-cause analysis lower production readiness and evidence strength.

Citations7

Evidence Strength0.50

Confidence0.75

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 40%

Authors

Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Björn W. Schuller, Amir Hussain

Links

Abstract / PDF

Why It Matters For Business

If your product depends on images that must exclude certain content (safety, branding, legal), current multimodal LLMs can silently fail and even claim they succeeded; add verification or blocklisting before shipping.

Who Should Care

Summary TLDR

This paper documents a reproducible failure mode for multimodal LLMs: when prompted to generate images with explicit exclusions (prompts using 'no' / negation), popular models often include the excluded element. The authors tested GPT-4, Gemini, Copilot and Llama-3 on five negation prompts across English, Hindi and French, ran each prompt five times, measured percent incorrect and semantic-entropy, and found frequent errors plus mismatches between the model's textual claim and the actual generated image. Statistical tests showed trends but not strong significance. The authors suggest adding a negation-aware feedback loop between text and image generation as a mitigation direction.

Problem Statement

Modern multimodal LLMs frequently fail to follow simple exclusion instructions in image-generation prompts (phrases using 'no' or negation), producing images that still contain the excluded items and sometimes reporting in text that the exclusion was followed when the image contradicts it.

Main Contribution

Identified and named the 'NO Syndrome'—a pattern where multimodal LLMs ignore explicit negation constraints in image prompts.

Systematic test: five negation prompts run five times across GPT-4, Gemini, Copilot, plus additional checks with Llama-3 in English, Hindi, and French.

Key Findings

For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.

Numbers0/5 correct across tested runs and languages (Section 3.6; Table 1)

Practical UseDo not trust LLM image outputs to enforce critical exclusions; add verification or fallback when exclusion is required.

Evidence RefSection 3.6; Table 1

Popular LLMs (GPT-4, Copilot) frequently produced incorrect images on negation prompts.

NumbersGPT-4 produced 45 incorrect images out of 5 on several queries (Table 1)

Practical UseExpect high failure rates on many 'no' prompts; test your specific exclusion cases before deployment.

Evidence RefTable 1; Section 3.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Incorrect image counts per prompt (GPT-4)Q1:4/5, Q2:5/5, Q3:5/5, Q4:3/5, Q5:2/5English prompts (Table 1)Table 1; Section 3.1Table 1
Incorrect image counts per prompt (Copilot)Q1:4/5, Q2:5/5, Q3:4/5, Q4:3/5, Q5:4/5English prompts (Table 1)Table 1; Section 3.1Table 1

What To Try In 7 Days

Run a small negation test-suite (your critical exclusion cases) and record failure rates.

Add a visual verification step: run a lightweight vision detector to check excluded items in generated images.

If failures occur, implement a rule-based fallback (reject or re-prompt) and log examples for model vendor feedback.

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Only a few LLMs tested (GPT-4, Gemini, Copilot, Llama-3) so results may not generalize to other or future models.

Limited languages (English, Hindi, French) and a small number of prompts and runs.

When Not To Use

Do not rely solely on LLM image outputs for exclusion-sensitive tasks (legal, safety, brand compliance).

Avoid using the model's text confirmation as a guarantee the image obeys constraints.

Failure Modes

Images include items that were explicitly excluded by the prompt.

Textual output claims the exclusion was followed while the image contradicts that claim.

Core Entities

Models

GPT-4GeminiCopilotLlama-3

Metrics

percent incorrectsemantic entropy

Context Entities

Metrics

Friedman test p-valueWilcoxon signed-rank p-value