Overview
Results are clear on the evaluated models and dataset, but experiments use two small open-source 3B models and a single pairwise benchmark, so generalization to larger models and other setups is untested.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals8
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 1/2
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 100%
Production readiness: 100%
Novelty: 100%
Why It Matters For Business
Automated LLM judging can be biased by short adversarial suffixes, meaning model selection, moderation, or automated annotation pipelines may be unreliable without safeguards.
Who Should Care
Summary TLDR
This paper finds that LLMs used as automatic evaluators (LLM-as-a-Judge) can be reliably manipulated by attaching short adversarial suffixes to candidate answers. The authors formalize two attacks — Comparative Undermining Attack (CUA) that targets the final decision and Justification Manipulation Attack (JMA) that targets the model's reasoning — and use Greedy Coordinate Gradient (GCG) to craft suffixes. Evaluated on MT-Bench pairwise data with two 3B open models (Qwen2.5-3B-Instruct, Falcon3-3B-Instruct), CUA reaches ~31–32% Attack Success Rate (ASR); JMA ~15–17%. Simple heuristics and random text have much lower ASR (1–5%). The study highlights a significant risk for automated evaluation,
Problem Statement
LLM-as-a-Judge systems are used to compare and pick the better answer automatically. The paper asks: how easy is it for an attacker to change a judge's decision by appending adversarial text to one candidate? It focuses on two attack goals — flip the winner or corrupt the judge's justification — and measures success on real judge models using optimized suffixes.
Main Contribution
Formalized two attack types on LLM judges: Comparative Undermining Attack (CUA) and Justification Manipulation Attack (JMA).
Adapted the Greedy Coordinate Gradient (GCG) token-level optimizer to craft adversarial suffixes that are appended to one answer.
Key Findings
Optimized decision-targeting suffixes (CUA) flip judge choices frequently.
Manipulating the judge's generated reasoning helps but is weaker than direct decision targeting.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| ASR by method (Qwen / Falcon) | Random 1.2% / 1.5%; Token-Shuffle 2.8% / 3.1%; Hard Prompt 5.1% / 5.4%; JMA 15.2% / 16.7%; JudgeDeceiver 22.8% / 24.1%; | — | — | MT-Bench Human Judgments | Table I; Sec V.A | Table I |
| CUA ASR | Qwen 31.2% / Falcon 32.4% | Hard Prompt | ≈+26 percentage points vs Hard Prompt | MT-Bench Human Judgments | Table I; Sec V.A | Table I |
What To Try In 7 Days
Run targeted ASR checks: append known templates and optimized suffixes to test your judge on MT-Bench-style pairs.
Add simple input canonicalization: strip odd appended blocks and normalize candidate text before judging.
Compare LLM-judge outputs to a small human-validation set to estimate real-world vulnerability rate.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Only two 3B open-source judge models were evaluated; larger/closed models may behave differently.
Attacks are limited to appending fixed-length suffixes; other attack vectors (e.g., input permutation) were not explored.
When Not To Use
Do not rely solely on LLM-as-a-Judge for high-stakes decisions without human oversight or input sanitization.
Avoid using conclusions here to claim robustness of larger proprietary models without direct testing.
Failure Modes
Attacks may transfer differently to larger or differently fine-tuned judges.
Detection based only on token presence may miss optimized, ordered suffixes.

