Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

May 19, 20257 min

Overview

Decision SnapshotReady For Pilot

Results are clear on the evaluated models and dataset, but experiments use two small open-source 3B models and a single pairwise benchmark, so generalization to larger models and other setups is untested.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/2

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 100%

Production readiness: 100%

Novelty: 100%

Authors

Narek Maloyan, Bislan Ashinov, Dmitry Namiot

Links

Abstract / PDF / Data

Why It Matters For Business

Automated LLM judging can be biased by short adversarial suffixes, meaning model selection, moderation, or automated annotation pipelines may be unreliable without safeguards.

Who Should Care

Summary TLDR

This paper finds that LLMs used as automatic evaluators (LLM-as-a-Judge) can be reliably manipulated by attaching short adversarial suffixes to candidate answers. The authors formalize two attacks — Comparative Undermining Attack (CUA) that targets the final decision and Justification Manipulation Attack (JMA) that targets the model's reasoning — and use Greedy Coordinate Gradient (GCG) to craft suffixes. Evaluated on MT-Bench pairwise data with two 3B open models (Qwen2.5-3B-Instruct, Falcon3-3B-Instruct), CUA reaches ~31–32% Attack Success Rate (ASR); JMA ~15–17%. Simple heuristics and random text have much lower ASR (1–5%). The study highlights a significant risk for automated evaluation,

Problem Statement

LLM-as-a-Judge systems are used to compare and pick the better answer automatically. The paper asks: how easy is it for an attacker to change a judge's decision by appending adversarial text to one candidate? It focuses on two attack goals — flip the winner or corrupt the judge's justification — and measures success on real judge models using optimized suffixes.

Main Contribution

Formalized two attack types on LLM judges: Comparative Undermining Attack (CUA) and Justification Manipulation Attack (JMA).

Adapted the Greedy Coordinate Gradient (GCG) token-level optimizer to craft adversarial suffixes that are appended to one answer.

Key Findings

Optimized decision-targeting suffixes (CUA) flip judge choices frequently.

NumbersCUA ASR: Qwen 31.2%, Falcon 32.4%

Practical UseIf you use LLMs as automatic judges, expect ~1 in 3 evaluations to be hijacked by a tailored suffix; add input defenses or human checks for high-stakes use.

Evidence RefTable I; Sec V.A

Manipulating the judge's generated reasoning helps but is weaker than direct decision targeting.

NumbersJMA ASR: Qwen 15.2%, Falcon 16.7%

Practical UseAttacks that change explanations can bias outcomes; monitor generated justifications and validate them against independent criteria.

Evidence RefTable I; Sec V.A

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ASR by method (Qwen / Falcon)Random 1.2% / 1.5%; Token-Shuffle 2.8% / 3.1%; Hard Prompt 5.1% / 5.4%; JMA 15.2% / 16.7%; JudgeDeceiver 22.8% / 24.1%;MT-Bench Human JudgmentsTable I; Sec V.ATable I
CUA ASRQwen 31.2% / Falcon 32.4%Hard Prompt≈+26 percentage points vs Hard PromptMT-Bench Human JudgmentsTable I; Sec V.ATable I

What To Try In 7 Days

Run targeted ASR checks: append known templates and optimized suffixes to test your judge on MT-Bench-style pairs.

Add simple input canonicalization: strip odd appended blocks and normalize candidate text before judging.

Compare LLM-judge outputs to a small human-validation set to estimate real-world vulnerability rate.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

MT-Bench (LMSYS) referenced in paper

Risks & Boundaries

Limitations

Only two 3B open-source judge models were evaluated; larger/closed models may behave differently.

Attacks are limited to appending fixed-length suffixes; other attack vectors (e.g., input permutation) were not explored.

When Not To Use

Do not rely solely on LLM-as-a-Judge for high-stakes decisions without human oversight or input sanitization.

Avoid using conclusions here to claim robustness of larger proprietary models without direct testing.

Failure Modes

Attacks may transfer differently to larger or differently fine-tuned judges.

Detection based only on token presence may miss optimized, ordered suffixes.

Core Entities

Models

Qwen2.5-3B-InstructFalcon3-3B-Instruct

Metrics

Attack Success Rate (ASR)

Datasets

MT-Bench Human Judgments (LMSYS)

Benchmarks

MT-Bench