Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

March 13, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

0

Authors

I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

Links

Abstract / PDF

Why It Matters For Business

Models that score well on standard tests may behave unpredictably when users reword requests; testing for semantic invariance prevents surprising errors in customer-facing or safety-critical agents.

Summary TLDR

The paper introduces a metamorphic testing framework to measure whether LLM-based reasoning agents give consistent outputs when semantically equivalent problem statements are rewritten. Across 7 models and 19 multi-step problems, smaller models sometimes produce more stable reasoning than larger ones. Contrastive prompts (adding plausible distractors) consistently break reasoning. The work exposes robustness patterns invisible to standard accuracy benchmarks and offers a practical test-suite for evaluating agent reliability.

Problem Statement

LLM agents can change answers when the same problem is reworded. Standard accuracy benchmarks use fixed phrasing and miss this instability. The paper asks: how stable are reasoning agents under semantic-preserving rewrites, and which models or architectures are most vulnerable?

Main Contribution

A metamorphic testing framework with eight semantic-preserving transformations (identity, paraphrase, reorder facts, expand, contract, academic framing, business framing, contrastive).

A multi-model study across seven foundation models (Hermes, Qwen3, DeepSeek-R1, gpt-oss) using 19 multi-step problems in eight domains.

Metrics for solution-level and trace-level invariance: semantic similarity, score delta, step accuracy, and stability rate.

Empirical findings showing scale does not predict robustness and that contrastive framing universally degrades reasoning.

Key Findings

Smaller model showed highest robustness: Qwen3-30B-A3B had the best stability and trace similarity across transformations.

NumbersStability 79.6%; MAD 0.049; semantic similarity 0.914

Contrastive (distractor) transformations consistently reduced scores across all models.

NumbersMean ∆ from −0.088 (Qwen3-30B) to −0.449 (gpt-oss-120b)

Model family shows distinct vulnerability signatures: Hermes vulnerable to contrastive framing; DeepSeek to fact reordering; gpt-oss unstable across many MRs.

NumbersHermes contrastive ∆ = −0.126 (70B), DeepSeek reorder ∆ = −0.171, gpt-oss contrastive ∆ = −0.449

Raw performance ranking (accuracy) differs from robustness ranking; larger size did not guarantee stability.

NumbersHermes-4-70B score 0.667 but MAD 0.086; Qwen3-30B score 0.514 but MAD 0.049

Results

Qwen3-30B-A3B: Stability Rate

Value79.6%

Qwen3-30B-A3B: MAD

Value0.049

gpt-oss-120b: Contrastive mean delta

Value−0.449

Hermes-4-70B: Overall score

Value0.667

Semantic similarity range by model

ValueQwen3-30B 0.914; gpt-oss-20b 0.527

Who Should Care

What To Try In 7 Days

Run the paper's 8 metamorphic transformations on your top candidate model for a small set of representative tasks.

Measure Stability Rate and MAD; shortlist models with low MAD before deployment.

Add a lightweight filter that flags contrastive or distractor-rich inputs for human review or ensemble handling.

Agent Features

Memory

  • Short-term reasoning traces (step sequences)

Planning

  • Chain-of-thought style step decomposition

Frameworks

  • Metamorphic testing (semantic invariance)

Is Agentic

true

Architectures

  • Dense Transformer (Hermes, gpt-oss)
  • MoE

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only 19 problems across eight domains; coverage is limited.
  • Single inference per problem-transformation pair; does not capture sampling variability.
  • Some transformations were generated with LLM assistance, which may introduce stylistic bias.
  • Models accessed via a platform (Nebius); exact prompts and infra may affect reproducibility.

When Not To Use

  • When you need absolute accuracy metrics across large public benchmarks (this complements, not replaces, such benchmarks).
  • When input rewrites intentionally change problem semantics (contrastive was included as a stress test only).

Failure Modes

  • Contrastive/distractor prompts causing major answer shifts.
  • Fact reordering breaking models that rely on presentation order.
  • Verbosity overload where extra context degrades attention and reasoning.

Core Entities

Models

  • Hermes-4-70B
  • Hermes-4-405B
  • Qwen3-30B-A3B
  • Qwen3-235B-A22B
  • DeepSeek-R1-0528
  • gpt-oss-20b
  • gpt-oss-120b

Metrics

  • Solution-level semantic similarity
  • Score delta (∆)
  • Mean Absolute Delta (MAD)
  • Stability Rate (|∆|<0.05)
  • Reasoning trace similarity
  • Accuracy

Datasets

  • 19 multi-step reasoning problems corpus (Physics, Math, Chemistry, Economics, Statistics, Biology, C

Benchmarks

  • Metamorphic invariance suite (8 transformations)