Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Overview

Decision SnapshotReady For Pilot

The study offers a clear, implementable test suite and reports statistically significant differences across models; however the small problem corpus and single-shot sampling limit generality.

Citations0

Evidence Strength0.78

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

Links

Abstract / PDF

Why It Matters For Business

Models that score well on standard tests may behave unpredictably when users reword requests; testing for semantic invariance prevents surprising errors in customer-facing or safety-critical agents.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist

Summary TLDR

The paper introduces a metamorphic testing framework to measure whether LLM-based reasoning agents give consistent outputs when semantically equivalent problem statements are rewritten. Across 7 models and 19 multi-step problems, smaller models sometimes produce more stable reasoning than larger ones. Contrastive prompts (adding plausible distractors) consistently break reasoning. The work exposes robustness patterns invisible to standard accuracy benchmarks and offers a practical test-suite for evaluating agent reliability.

Problem Statement

LLM agents can change answers when the same problem is reworded. Standard accuracy benchmarks use fixed phrasing and miss this instability. The paper asks: how stable are reasoning agents under semantic-preserving rewrites, and which models or architectures are most vulnerable?

Main Contribution

A metamorphic testing framework with eight semantic-preserving transformations (identity, paraphrase, reorder facts, expand, contract, academic framing, business framing, contrastive).

A multi-model study across seven foundation models (Hermes, Qwen3, DeepSeek-R1, gpt-oss) using 19 multi-step problems in eight domains.

Key Findings

Smaller model showed highest robustness: Qwen3-30B-A3B had the best stability and trace similarity across transformations.

NumbersStability 79.6%; MAD 0.049; semantic similarity 0.914

Practical UsePrefer model robustness metrics, not just size or raw score, when picking agents for variable-input tasks; test candidate models with semantic-preserving rewrites.

Evidence RefTable 3, Sec.5

Contrastive (distractor) transformations consistently reduced scores across all models.

NumbersMean ∆ from −0.088 (Qwen3-30B) to −0.449 (gpt-oss-120b)

Practical UseAvoid exposing agents to contrastive or distractor-rich prompts in safety-critical paths; add checks or ensemble voting when input may contain misleading alternatives.

Evidence RefFinding 4, Fig.4, Sec.5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Qwen3-30B-A3B: Stability Rate	79.6%	—	—	All transformations, 19 problems	Highest reported stability across models	Table 3
Qwen3-30B-A3B: MAD	0.049	—	—	All transformations	Lowest mean absolute score change	Table 3

What To Try In 7 Days

Run the paper's 8 metamorphic transformations on your top candidate model for a small set of representative tasks.

Measure Stability Rate and MAD; shortlist models with low MAD before deployment.

Add a lightweight filter that flags contrastive or distractor-rich inputs for human review or ensemble handling.

Agent Features

Memory

Short-term reasoning traces (step sequences)

Planning

Chain-of-thought style step decomposition

Frameworks

Metamorphic testing (semantic invariance)

Is Agentic

Yes

Architectures

Dense Transformer (Hermes, gpt-oss)MoE

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Only 19 problems across eight domains; coverage is limited.

Single inference per problem-transformation pair; does not capture sampling variability.

When Not To Use

When you need absolute accuracy metrics across large public benchmarks (this complements, not replaces, such benchmarks).

When input rewrites intentionally change problem semantics (contrastive was included as a stress test only).

Failure Modes

Contrastive/distractor prompts causing major answer shifts.

Fact reordering breaking models that rely on presentation order.

Core Entities

Models

Hermes-4-70BHermes-4-405BQwen3-30B-A3BQwen3-235B-A22BDeepSeek-R1-0528gpt-oss-20bgpt-oss-120b

Metrics

Solution-level semantic similarityScore delta (∆)Mean Absolute Delta (MAD)Stability Rate (|∆|<0.05)Reasoning trace similarityAccuracy

Datasets

19 multi-step reasoning problems corpus (Physics, Math, Chemistry, Economics, Statistics, Biology, C

Benchmarks

Metamorphic invariance suite (8 transformations)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Smaller model showed highest robustness: Qwen3-30B-A3B had the best stability and trace similarity across transformations.

Contrastive (distractor) transformations consistently reduced scores across all models.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding