Many top multimodal LLMs ignore explicit 'no' constraints and still draw the excluded object

August 27, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.3

Citation Count

7

Authors

Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Björn W. Schuller, Amir Hussain

Links

Abstract / PDF

Why It Matters For Business

If your product depends on images that must exclude certain content (safety, branding, legal), current multimodal LLMs can silently fail and even claim they succeeded; add verification or blocklisting before shipping.

Summary TLDR

This paper documents a reproducible failure mode for multimodal LLMs: when prompted to generate images with explicit exclusions (prompts using 'no' / negation), popular models often include the excluded element. The authors tested GPT-4, Gemini, Copilot and Llama-3 on five negation prompts across English, Hindi and French, ran each prompt five times, measured percent incorrect and semantic-entropy, and found frequent errors plus mismatches between the model's textual claim and the actual generated image. Statistical tests showed trends but not strong significance. The authors suggest adding a negation-aware feedback loop between text and image generation as a mitigation direction.

Problem Statement

Modern multimodal LLMs frequently fail to follow simple exclusion instructions in image-generation prompts (phrases using 'no' or negation), producing images that still contain the excluded items and sometimes reporting in text that the exclusion was followed when the image contradicts it.

Main Contribution

Identified and named the 'NO Syndrome'—a pattern where multimodal LLMs ignore explicit negation constraints in image prompts.

Systematic test: five negation prompts run five times across GPT-4, Gemini, Copilot, plus additional checks with Llama-3 in English, Hindi, and French.

Quantified failures with per-prompt incorrect counts and semantic-entropy, and ran Friedman and Wilcoxon tests to compare models.

Key Findings

For the prompt 'Generate an image of an elephant with no tusks', no model produced a correct image in any tested run or language.

Numbers0/5 correct across tested runs and languages (Section 3.6; Table 1)

Popular LLMs (GPT-4, Copilot) frequently produced incorrect images on negation prompts.

NumbersGPT-4 produced 4–5 incorrect images out of 5 on several queries (Table 1)

Textual confirmation and image output often disagreed: the model's text said the exclusion was followed while the image contained the excluded element.

Model-level differences trended but were not statistically significant at p<0.05.

NumbersFriedman p=0.056 (English); Wilcoxon p=0.084 (GPT-4 vs Copilot)

Copilot showed higher response variability (entropy) and more consistent incorrect responses for some languages.

NumbersEntropy patterns reported in Figure 4 (Copilot higher in English)

Results

Incorrect image counts per prompt (GPT-4)

ValueQ1:4/5, Q2:5/5, Q3:5/5, Q4:3/5, Q5:2/5

Incorrect image counts per prompt (Copilot)

ValueQ1:4/5, Q2:5/5, Q3:4/5, Q4:3/5, Q5:4/5

Friedman test (models on English prompts)

Valuep = 0.056

Baselinesignificance threshold 0.05

Wilcoxon signed-rank (GPT-4 vs Copilot)

Valuep = 0.084

Baselinesignificance threshold 0.05

Who Should Care

What To Try In 7 Days

Run a small negation test-suite (your critical exclusion cases) and record failure rates.

Add a visual verification step: run a lightweight vision detector to check excluded items in generated images.

If failures occur, implement a rule-based fallback (reject or re-prompt) and log examples for model vendor feedback.

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Only a few LLMs tested (GPT-4, Gemini, Copilot, Llama-3) so results may not generalize to other or future models.
  • Limited languages (English, Hindi, French) and a small number of prompts and runs.
  • No causal analysis; the paper does not identify why the failure happens or which subsystem (text vs image) causes it.
  • Model updates after the study may change outcomes.

When Not To Use

  • Do not rely solely on LLM image outputs for exclusion-sensitive tasks (legal, safety, brand compliance).
  • Avoid using the model's text confirmation as a guarantee the image obeys constraints.

Failure Modes

  • Images include items that were explicitly excluded by the prompt.
  • Textual output claims the exclusion was followed while the image contradicts that claim.
  • Language support gaps: some models refused or failed on non-English prompts.

Core Entities

Models

  • GPT-4
  • Gemini
  • Copilot
  • Llama-3

Metrics

  • percent incorrect
  • semantic entropy

Context Entities

Metrics

  • Friedman test p-value
  • Wilcoxon signed-rank p-value