Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

March 13, 20267 min

Overview

Decision SnapshotReady For Pilot

The study offers a clear, implementable test suite and reports statistically significant differences across models; however the small problem corpus and single-shot sampling limit generality.

Citations0

Evidence Strength0.78

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 70%

Authors

I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

Links

Abstract / PDF

Why It Matters For Business

Models that score well on standard tests may behave unpredictably when users reword requests; testing for semantic invariance prevents surprising errors in customer-facing or safety-critical agents.

Who Should Care

Summary TLDR

The paper introduces a metamorphic testing framework to measure whether LLM-based reasoning agents give consistent outputs when semantically equivalent problem statements are rewritten. Across 7 models and 19 multi-step problems, smaller models sometimes produce more stable reasoning than larger ones. Contrastive prompts (adding plausible distractors) consistently break reasoning. The work exposes robustness patterns invisible to standard accuracy benchmarks and offers a practical test-suite for evaluating agent reliability.

Problem Statement

LLM agents can change answers when the same problem is reworded. Standard accuracy benchmarks use fixed phrasing and miss this instability. The paper asks: how stable are reasoning agents under semantic-preserving rewrites, and which models or architectures are most vulnerable?

Main Contribution

A metamorphic testing framework with eight semantic-preserving transformations (identity, paraphrase, reorder facts, expand, contract, academic framing, business framing, contrastive).

A multi-model study across seven foundation models (Hermes, Qwen3, DeepSeek-R1, gpt-oss) using 19 multi-step problems in eight domains.

Key Findings

Smaller model showed highest robustness: Qwen3-30B-A3B had the best stability and trace similarity across transformations.

NumbersStability 79.6%; MAD 0.049; semantic similarity 0.914

Practical UsePrefer model robustness metrics, not just size or raw score, when picking agents for variable-input tasks; test candidate models with semantic-preserving rewrites.

Evidence RefTable 3, Sec.5

Contrastive (distractor) transformations consistently reduced scores across all models.

NumbersMean ∆ from −0.088 (Qwen3-30B) to −0.449 (gpt-oss-120b)

Practical UseAvoid exposing agents to contrastive or distractor-rich prompts in safety-critical paths; add checks or ensemble voting when input may contain misleading alternatives.

Evidence RefFinding 4, Fig.4, Sec.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Qwen3-30B-A3B: Stability Rate79.6%All transformations, 19 problemsHighest reported stability across modelsTable 3
Qwen3-30B-A3B: MAD0.049All transformationsLowest mean absolute score changeTable 3

What To Try In 7 Days

Run the paper's 8 metamorphic transformations on your top candidate model for a small set of representative tasks.

Measure Stability Rate and MAD; shortlist models with low MAD before deployment.

Add a lightweight filter that flags contrastive or distractor-rich inputs for human review or ensemble handling.

Agent Features

Memory
Short-term reasoning traces (step sequences)
Planning
Chain-of-thought style step decomposition
Frameworks
Metamorphic testing (semantic invariance)
Is Agentic

Yes

Architectures
Dense Transformer (Hermes, gpt-oss)MoE

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only 19 problems across eight domains; coverage is limited.

Single inference per problem-transformation pair; does not capture sampling variability.

When Not To Use

When you need absolute accuracy metrics across large public benchmarks (this complements, not replaces, such benchmarks).

When input rewrites intentionally change problem semantics (contrastive was included as a stress test only).

Failure Modes

Contrastive/distractor prompts causing major answer shifts.

Fact reordering breaking models that rely on presentation order.

Core Entities

Models

Hermes-4-70BHermes-4-405BQwen3-30B-A3BQwen3-235B-A22BDeepSeek-R1-0528gpt-oss-20bgpt-oss-120b

Metrics

Solution-level semantic similarityScore delta (∆)Mean Absolute Delta (MAD)Stability Rate (|∆|<0.05)Reasoning trace similarityAccuracy

Datasets

19 multi-step reasoning problems corpus (Physics, Math, Chemistry, Economics, Statistics, Biology, C

Benchmarks

Metamorphic invariance suite (8 transformations)