Overview
Production Readiness
1
Novelty Score
1
Cost Impact Score
1
Citation Count
0
Why It Matters For Business
Automated LLM judging can be biased by short adversarial suffixes, meaning model selection, moderation, or automated annotation pipelines may be unreliable without safeguards.
Summary TLDR
This paper finds that LLMs used as automatic evaluators (LLM-as-a-Judge) can be reliably manipulated by attaching short adversarial suffixes to candidate answers. The authors formalize two attacks — Comparative Undermining Attack (CUA) that targets the final decision and Justification Manipulation Attack (JMA) that targets the model's reasoning — and use Greedy Coordinate Gradient (GCG) to craft suffixes. Evaluated on MT-Bench pairwise data with two 3B open models (Qwen2.5-3B-Instruct, Falcon3-3B-Instruct), CUA reaches ~31–32% Attack Success Rate (ASR); JMA ~15–17%. Simple heuristics and random text have much lower ASR (1–5%). The study highlights a significant risk for automated evaluation,
Problem Statement
LLM-as-a-Judge systems are used to compare and pick the better answer automatically. The paper asks: how easy is it for an attacker to change a judge's decision by appending adversarial text to one candidate? It focuses on two attack goals — flip the winner or corrupt the judge's justification — and measures success on real judge models using optimized suffixes.
Main Contribution
Formalized two attack types on LLM judges: Comparative Undermining Attack (CUA) and Justification Manipulation Attack (JMA).
Adapted the Greedy Coordinate Gradient (GCG) token-level optimizer to craft adversarial suffixes that are appended to one answer.
Evaluated attacks on MT-Bench human pairwise judgments using two open-source 3B instruction-tuned models: Qwen2.5-3B-Instruct and Falcon3-3B-Instruct.
Compared optimized attacks against several controls: Random-Suffix, Token-Shuffle, and Hard Prompt, and against the JudgeDeceiver universal-template method.
Quantified effectiveness using Attack Success Rate (ASR) and demonstrated CUA as the most effective method (>30% ASR).
Key Findings
Optimized decision-targeting suffixes (CUA) flip judge choices frequently.
Manipulating the judge's generated reasoning helps but is weaker than direct decision targeting.
Simple heuristics and random text have minimal effect.
Universal template attacks (JudgeDeceiver) are effective without per-instance optimization.
Token order matters: shuffled attack tokens lose most power.
Results
ASR by method (Qwen / Falcon)
CUA ASR
Who Should Care
What To Try In 7 Days
Run targeted ASR checks: append known templates and optimized suffixes to test your judge on MT-Bench-style pairs.
Add simple input canonicalization: strip odd appended blocks and normalize candidate text before judging.
Compare LLM-judge outputs to a small human-validation set to estimate real-world vulnerability rate.
Reproducibility
Data Urls
- MT-Bench (LMSYS) referenced in paper
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Only two 3B open-source judge models were evaluated; larger/closed models may behave differently.
- Attacks are limited to appending fixed-length suffixes; other attack vectors (e.g., input permutation) were not explored.
- No defenses were implemented or evaluated; recommendations are high-level.
When Not To Use
- Do not rely solely on LLM-as-a-Judge for high-stakes decisions without human oversight or input sanitization.
- Avoid using conclusions here to claim robustness of larger proprietary models without direct testing.
Failure Modes
- Attacks may transfer differently to larger or differently fine-tuned judges.
- Detection based only on token presence may miss optimized, ordered suffixes.
- Paper does not evaluate adaptive attackers who try to evade proposed controls.
Core Entities
Models
- Qwen2.5-3B-Instruct
- Falcon3-3B-Instruct
Metrics
- Attack Success Rate (ASR)
Datasets
- MT-Bench Human Judgments (LMSYS)
Benchmarks
- MT-Bench

