Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
Models that score well on standard tests may behave unpredictably when users reword requests; testing for semantic invariance prevents surprising errors in customer-facing or safety-critical agents.
Summary TLDR
The paper introduces a metamorphic testing framework to measure whether LLM-based reasoning agents give consistent outputs when semantically equivalent problem statements are rewritten. Across 7 models and 19 multi-step problems, smaller models sometimes produce more stable reasoning than larger ones. Contrastive prompts (adding plausible distractors) consistently break reasoning. The work exposes robustness patterns invisible to standard accuracy benchmarks and offers a practical test-suite for evaluating agent reliability.
Problem Statement
LLM agents can change answers when the same problem is reworded. Standard accuracy benchmarks use fixed phrasing and miss this instability. The paper asks: how stable are reasoning agents under semantic-preserving rewrites, and which models or architectures are most vulnerable?
Main Contribution
A metamorphic testing framework with eight semantic-preserving transformations (identity, paraphrase, reorder facts, expand, contract, academic framing, business framing, contrastive).
A multi-model study across seven foundation models (Hermes, Qwen3, DeepSeek-R1, gpt-oss) using 19 multi-step problems in eight domains.
Metrics for solution-level and trace-level invariance: semantic similarity, score delta, step accuracy, and stability rate.
Empirical findings showing scale does not predict robustness and that contrastive framing universally degrades reasoning.
Key Findings
Smaller model showed highest robustness: Qwen3-30B-A3B had the best stability and trace similarity across transformations.
Contrastive (distractor) transformations consistently reduced scores across all models.
Model family shows distinct vulnerability signatures: Hermes vulnerable to contrastive framing; DeepSeek to fact reordering; gpt-oss unstable across many MRs.
Raw performance ranking (accuracy) differs from robustness ranking; larger size did not guarantee stability.
Results
Qwen3-30B-A3B: Stability Rate
Qwen3-30B-A3B: MAD
gpt-oss-120b: Contrastive mean delta
Hermes-4-70B: Overall score
Semantic similarity range by model
Who Should Care
What To Try In 7 Days
Run the paper's 8 metamorphic transformations on your top candidate model for a small set of representative tasks.
Measure Stability Rate and MAD; shortlist models with low MAD before deployment.
Add a lightweight filter that flags contrastive or distractor-rich inputs for human review or ensemble handling.
Agent Features
Memory
- Short-term reasoning traces (step sequences)
Planning
- Chain-of-thought style step decomposition
Frameworks
- Metamorphic testing (semantic invariance)
Is Agentic
true
Architectures
- Dense Transformer (Hermes, gpt-oss)
- MoE
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only 19 problems across eight domains; coverage is limited.
- Single inference per problem-transformation pair; does not capture sampling variability.
- Some transformations were generated with LLM assistance, which may introduce stylistic bias.
- Models accessed via a platform (Nebius); exact prompts and infra may affect reproducibility.
When Not To Use
- When you need absolute accuracy metrics across large public benchmarks (this complements, not replaces, such benchmarks).
- When input rewrites intentionally change problem semantics (contrastive was included as a stress test only).
Failure Modes
- Contrastive/distractor prompts causing major answer shifts.
- Fact reordering breaking models that rely on presentation order.
- Verbosity overload where extra context degrades attention and reasoning.
Core Entities
Models
- Hermes-4-70B
- Hermes-4-405B
- Qwen3-30B-A3B
- Qwen3-235B-A22B
- DeepSeek-R1-0528
- gpt-oss-20b
- gpt-oss-120b
Metrics
- Solution-level semantic similarity
- Score delta (∆)
- Mean Absolute Delta (MAD)
- Stability Rate (|∆|<0.05)
- Reasoning trace similarity
- Accuracy
Datasets
- 19 multi-step reasoning problems corpus (Physics, Math, Chemistry, Economics, Statistics, Biology, C
Benchmarks
- Metamorphic invariance suite (8 transformations)

