Overview
Production Readiness
0.2
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
If your product uses LLMs to rank or judge content, attackers can bottle-manufacture short token suffixes that make the judge pick malicious or low-quality content. This can poison leaderboards, search results, automated labels for training, or tool selection.
Summary TLDR
The paper introduces JudgeDeceiver, an automatic, gradient-guided method that appends a short injected token sequence to an attacker-controlled candidate. On open-source judge models and two evaluation sets, the attack forces the judge to pick the attacker’s response with very high success (often >90%) and remains robust to response-order changes. Common defenses (known-answer checks, perplexity filters) miss many attacks. The authors release code and evaluate transferability, ablations, and three real scenarios: LLM-powered search, RLAIF, and tool selection.
Problem Statement
LLM-as-a-Judge systems pick the best answer from multiple candidates. If an attacker can add text to one candidate, can they reliably bias the judge to choose that candidate? Existing prompt-injection and jailbreak tricks are manual and brittle. The paper asks whether an optimization-based injected sequence can consistently manipulate judge outputs across unknown candidate sets and positions.
Main Contribution
JudgeDeceiver: a first optimization-based attack that automatically generates injected sequences to bias LLM-as-a-Judge.
A loss formulation combining target-aligned generation, positional (target-enhancement), and adversarial perplexity terms, solved with discrete gradient-guided search.
Extensive evaluation: multiple open-source LLM judges, two benchmarks (MT-Bench, LLMBar), transfer tests, and three real-world case studies (search, RLAIF, tool selection).
Demonstration that common detection defenses (known-answer, PPL, PPL-windowed) still miss a large fraction of attacks; code released.
Key Findings
JudgeDeceiver yields high attack success rates against open-source judges.
The attack keeps working when response order changes.
JudgeDeceiver strongly outperforms manual prompt-injection baselines.
Common detection defenses still miss many attacks.
Attack transferability varies with model scale and source judge.
Attack effectiveness depends on shadow set size vs real candidate count.
Results
ASR (attack success rate)
PAC (positional attack consistency)
Comparison vs manual prompt attacks (best baseline ASR)
Known-answer detection failure
PPL-W detection miss rate
Transferability
Who Should Care
What To Try In 7 Days
Audit recent judge decisions for suspicious clustering of a single submitter across queries.
Add human spot-checks for leaderboard entries and search filters, prioritizing high-impact queries.
Limit or sanitize untrusted candidate content before passing to the judge (e.g., strip suspicious trailing tokens). note this is imperfect but reduces risk quickly.
Optimization Features
Token Efficiency
- 20-token suffix optimization (compact suffixes shown effective)
Reproducibility
Data Urls
- MT-Bench (public benchmark)
- LLMBar (public benchmark)
- HH-RLHF (public dataset)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Assumes attacker can submit or modify one candidate response and knows the target question-response pair.
- Evaluations focus on open-source judges; proprietary API-based judges may behave differently.
- Attack quality depends on shadow dataset size; larger real candidate pools can reduce effectiveness unless attacker invests more compute.
- Perplexity loss trades stealth for effectiveness; optimizing stealth may reduce ASR.
When Not To Use
- When all candidate responses are fully curated and not editable by external users.
- When the judge model is a closed proprietary LLM with unknown behavior and no public prompt template.
- When you cannot submit multiple iterative trials to learn the judge’s output template.
Failure Modes
- Human review or manual audits detect and override malicious selections.
- Aggressive input sanitization or truncation removes or neutralizes the injected suffix.
- Finetuning the judge on injection-aware data or using ensemble judges reduces single-vector attack success.
- Perplexity detectors tuned with representative adversarial data may detect some attacks.
Core Entities
Models
- Mistral-7B-Instruct
- Llama-2-7B-chat
- Llama-3-8B-Instruct
- Openchat-3.5
- Vicuna-7B
- Vicuna-13B
- GPT-3.5-turbo
- GPT-4
Metrics
- ASR
- PAC
- ACC
- ASR-B
- FNR
- FPR
Datasets
- MT-Bench
- LLMBar
- HH-RLHF
- MetaTool (tool selection benchmark)
Benchmarks
- MT-Bench
- LLMBar
- MetaTool

