Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

May 19, 20257 min

Overview

Production Readiness

1

Novelty Score

1

Cost Impact Score

1

Citation Count

0

Authors

Narek Maloyan, Bislan Ashinov, Dmitry Namiot

Links

Abstract / PDF

Why It Matters For Business

Automated LLM judging can be biased by short adversarial suffixes, meaning model selection, moderation, or automated annotation pipelines may be unreliable without safeguards.

Summary TLDR

This paper finds that LLMs used as automatic evaluators (LLM-as-a-Judge) can be reliably manipulated by attaching short adversarial suffixes to candidate answers. The authors formalize two attacks — Comparative Undermining Attack (CUA) that targets the final decision and Justification Manipulation Attack (JMA) that targets the model's reasoning — and use Greedy Coordinate Gradient (GCG) to craft suffixes. Evaluated on MT-Bench pairwise data with two 3B open models (Qwen2.5-3B-Instruct, Falcon3-3B-Instruct), CUA reaches ~31–32% Attack Success Rate (ASR); JMA ~15–17%. Simple heuristics and random text have much lower ASR (1–5%). The study highlights a significant risk for automated evaluation,

Problem Statement

LLM-as-a-Judge systems are used to compare and pick the better answer automatically. The paper asks: how easy is it for an attacker to change a judge's decision by appending adversarial text to one candidate? It focuses on two attack goals — flip the winner or corrupt the judge's justification — and measures success on real judge models using optimized suffixes.

Main Contribution

Formalized two attack types on LLM judges: Comparative Undermining Attack (CUA) and Justification Manipulation Attack (JMA).

Adapted the Greedy Coordinate Gradient (GCG) token-level optimizer to craft adversarial suffixes that are appended to one answer.

Evaluated attacks on MT-Bench human pairwise judgments using two open-source 3B instruction-tuned models: Qwen2.5-3B-Instruct and Falcon3-3B-Instruct.

Compared optimized attacks against several controls: Random-Suffix, Token-Shuffle, and Hard Prompt, and against the JudgeDeceiver universal-template method.

Quantified effectiveness using Attack Success Rate (ASR) and demonstrated CUA as the most effective method (>30% ASR).

Key Findings

Optimized decision-targeting suffixes (CUA) flip judge choices frequently.

NumbersCUA ASR: Qwen 31.2%, Falcon 32.4%

Manipulating the judge's generated reasoning helps but is weaker than direct decision targeting.

NumbersJMA ASR: Qwen 15.2%, Falcon 16.7%

Simple heuristics and random text have minimal effect.

NumbersRandom 1.2–1.5%, Hard Prompt 5.1–5.4%

Universal template attacks (JudgeDeceiver) are effective without per-instance optimization.

NumbersJudgeDeceiver ASR: Qwen 22.8%, Falcon 24.1%

Token order matters: shuffled attack tokens lose most power.

NumbersToken-Shuffle ASR: Qwen 2.8%, Falcon 3.1%

Results

ASR by method (Qwen / Falcon)

ValueRandom 1.2% / 1.5%; Token-Shuffle 2.8% / 3.1%; Hard Prompt 5.1% / 5.4%; JMA 15.2% / 16.7%; JudgeDeceiver 22.8% / 24.1%;

CUA ASR

ValueQwen 31.2% / Falcon 32.4%

BaselineHard Prompt

Who Should Care

What To Try In 7 Days

Run targeted ASR checks: append known templates and optimized suffixes to test your judge on MT-Bench-style pairs.

Add simple input canonicalization: strip odd appended blocks and normalize candidate text before judging.

Compare LLM-judge outputs to a small human-validation set to estimate real-world vulnerability rate.

Reproducibility

Data Urls

  • MT-Bench (LMSYS) referenced in paper

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only two 3B open-source judge models were evaluated; larger/closed models may behave differently.
  • Attacks are limited to appending fixed-length suffixes; other attack vectors (e.g., input permutation) were not explored.
  • No defenses were implemented or evaluated; recommendations are high-level.

When Not To Use

  • Do not rely solely on LLM-as-a-Judge for high-stakes decisions without human oversight or input sanitization.
  • Avoid using conclusions here to claim robustness of larger proprietary models without direct testing.

Failure Modes

  • Attacks may transfer differently to larger or differently fine-tuned judges.
  • Detection based only on token presence may miss optimized, ordered suffixes.
  • Paper does not evaluate adaptive attackers who try to evade proposed controls.

Core Entities

Models

  • Qwen2.5-3B-Instruct
  • Falcon3-3B-Instruct

Metrics

  • Attack Success Rate (ASR)

Datasets

  • MT-Bench Human Judgments (LMSYS)

Benchmarks

  • MT-Bench