Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Overview

Decision SnapshotReady For Pilot

Results are clear on the evaluated models and dataset, but experiments use two small open-source 3B models and a single pairwise benchmark, so generalization to larger models and other setups is untested.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals8

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/2

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 100%

Production readiness: 100%

Novelty: 100%

Authors

Narek Maloyan, Bislan Ashinov, Dmitry Namiot

Links

Abstract / PDF / Data

Why It Matters For Business

Automated LLM judging can be biased by short adversarial suffixes, meaning model selection, moderation, or automated annotation pipelines may be unreliable without safeguards.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

This paper finds that LLMs used as automatic evaluators (LLM-as-a-Judge) can be reliably manipulated by attaching short adversarial suffixes to candidate answers. The authors formalize two attacks — Comparative Undermining Attack (CUA) that targets the final decision and Justification Manipulation Attack (JMA) that targets the model's reasoning — and use Greedy Coordinate Gradient (GCG) to craft suffixes. Evaluated on MT-Bench pairwise data with two 3B open models (Qwen2.5-3B-Instruct, Falcon3-3B-Instruct), CUA reaches ~31–32% Attack Success Rate (ASR); JMA ~15–17%. Simple heuristics and random text have much lower ASR (1–5%). The study highlights a significant risk for automated evaluation,

Problem Statement

LLM-as-a-Judge systems are used to compare and pick the better answer automatically. The paper asks: how easy is it for an attacker to change a judge's decision by appending adversarial text to one candidate? It focuses on two attack goals — flip the winner or corrupt the judge's justification — and measures success on real judge models using optimized suffixes.

Main Contribution

Formalized two attack types on LLM judges: Comparative Undermining Attack (CUA) and Justification Manipulation Attack (JMA).

Adapted the Greedy Coordinate Gradient (GCG) token-level optimizer to craft adversarial suffixes that are appended to one answer.

Key Findings

Optimized decision-targeting suffixes (CUA) flip judge choices frequently.

NumbersCUA ASR: Qwen 31.2%, Falcon 32.4%

Practical UseIf you use LLMs as automatic judges, expect ~1 in 3 evaluations to be hijacked by a tailored suffix; add input defenses or human checks for high-stakes use.

Evidence RefTable I; Sec V.A

Manipulating the judge's generated reasoning helps but is weaker than direct decision targeting.

NumbersJMA ASR: Qwen 15.2%, Falcon 16.7%

Practical UseAttacks that change explanations can bias outcomes; monitor generated justifications and validate them against independent criteria.

Evidence RefTable I; Sec V.A

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ASR by method (Qwen / Falcon)	Random 1.2% / 1.5%; Token-Shuffle 2.8% / 3.1%; Hard Prompt 5.1% / 5.4%; JMA 15.2% / 16.7%; JudgeDeceiver 22.8% / 24.1%;	—	—	MT-Bench Human Judgments	Table I; Sec V.A	Table I
CUA ASR	Qwen 31.2% / Falcon 32.4%	Hard Prompt	≈+26 percentage points vs Hard Prompt	MT-Bench Human Judgments	Table I; Sec V.A	Table I

What To Try In 7 Days

Run targeted ASR checks: append known templates and optimized suffixes to test your judge on MT-Bench-style pairs.

Add simple input canonicalization: strip odd appended blocks and normalize candidate text before judging.

Compare LLM-judge outputs to a small human-validation set to estimate real-world vulnerability rate.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

MT-Bench (LMSYS) referenced in paper

Risks & Boundaries

Limitations

Only two 3B open-source judge models were evaluated; larger/closed models may behave differently.

Attacks are limited to appending fixed-length suffixes; other attack vectors (e.g., input permutation) were not explored.

When Not To Use

Do not rely solely on LLM-as-a-Judge for high-stakes decisions without human oversight or input sanitization.

Avoid using conclusions here to claim robustness of larger proprietary models without direct testing.

Failure Modes

Attacks may transfer differently to larger or differently fine-tuned judges.

Detection based only on token presence may miss optimized, ordered suffixes.

Core Entities

Models

Qwen2.5-3B-InstructFalcon3-3B-Instruct

Metrics

Attack Success Rate (ASR)

Datasets

MT-Bench Human Judgments (LMSYS)

Benchmarks

MT-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Optimized decision-targeting suffixes (CUA) flip judge choices frequently.

Manipulating the judge's generated reasoning helps but is weaker than direct decision targeting.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding