JudgeDeceiver: automatically craft prompts that reliably trick LLM-as-a-Judge to pick an attacker’s response

March 26, 20248 min

Overview

Production Readiness

0.2

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

2

Authors

Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong

Links

Abstract / PDF

Why It Matters For Business

If your product uses LLMs to rank or judge content, attackers can bottle-manufacture short token suffixes that make the judge pick malicious or low-quality content. This can poison leaderboards, search results, automated labels for training, or tool selection.

Summary TLDR

The paper introduces JudgeDeceiver, an automatic, gradient-guided method that appends a short injected token sequence to an attacker-controlled candidate. On open-source judge models and two evaluation sets, the attack forces the judge to pick the attacker’s response with very high success (often >90%) and remains robust to response-order changes. Common defenses (known-answer checks, perplexity filters) miss many attacks. The authors release code and evaluate transferability, ablations, and three real scenarios: LLM-powered search, RLAIF, and tool selection.

Problem Statement

LLM-as-a-Judge systems pick the best answer from multiple candidates. If an attacker can add text to one candidate, can they reliably bias the judge to choose that candidate? Existing prompt-injection and jailbreak tricks are manual and brittle. The paper asks whether an optimization-based injected sequence can consistently manipulate judge outputs across unknown candidate sets and positions.

Main Contribution

JudgeDeceiver: a first optimization-based attack that automatically generates injected sequences to bias LLM-as-a-Judge.

A loss formulation combining target-aligned generation, positional (target-enhancement), and adversarial perplexity terms, solved with discrete gradient-guided search.

Extensive evaluation: multiple open-source LLM judges, two benchmarks (MT-Bench, LLMBar), transfer tests, and three real-world case studies (search, RLAIF, tool selection).

Demonstration that common detection defenses (known-answer, PPL, PPL-windowed) still miss a large fraction of attacks; code released.

Key Findings

JudgeDeceiver yields high attack success rates against open-source judges.

NumbersASR = 90.8% (Mistral-7B, MT-Bench average)

The attack keeps working when response order changes.

NumbersPAC = 83.4% (Mistral-7B, MT-Bench average)

JudgeDeceiver strongly outperforms manual prompt-injection baselines.

NumbersBaseline max ASR ≤ 40.7%; JudgeDeceiver ASR up to 98.9%

Common detection defenses still miss many attacks.

NumbersPPL-W misses 70% of attacks on Llama-3-8B while FPR <1%

Attack transferability varies with model scale and source judge.

NumbersInjected sequences from Llama-3-8B transfer with ASR=70% to GPT-3.5 and ~99% to similar-scale Llama models

Attack effectiveness depends on shadow set size vs real candidate count.

NumbersASR drops when evaluation candidate count n > shadow count m; maintaining m ≥ n keeps high ASR

Results

ASR (attack success rate)

Value90.8% average (Mistral-7B, MT-Bench)

PAC (positional attack consistency)

Value83.4% average (Mistral-7B, MT-Bench)

Comparison vs manual prompt attacks (best baseline ASR)

Value≤40.7% (manual best)

Known-answer detection failure

ValueFNR = 100% on LLMBar (cannot detect optimized injected sequences)

PPL-W detection miss rate

ValueFNR = 70% (Llama-3-8B, PPL-W, FPR <1%)

Transferability

ValueASR = 70% to GPT-3.5 (injected seq optimized on Llama-3-8B)

Who Should Care

What To Try In 7 Days

Audit recent judge decisions for suspicious clustering of a single submitter across queries.

Add human spot-checks for leaderboard entries and search filters, prioritizing high-impact queries.

Limit or sanitize untrusted candidate content before passing to the judge (e.g., strip suspicious trailing tokens). note this is imperfect but reduces risk quickly.

Optimization Features

Token Efficiency

  • 20-token suffix optimization (compact suffixes shown effective)

Reproducibility

Data Urls

  • MT-Bench (public benchmark)
  • LLMBar (public benchmark)
  • HH-RLHF (public dataset)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Assumes attacker can submit or modify one candidate response and knows the target question-response pair.
  • Evaluations focus on open-source judges; proprietary API-based judges may behave differently.
  • Attack quality depends on shadow dataset size; larger real candidate pools can reduce effectiveness unless attacker invests more compute.
  • Perplexity loss trades stealth for effectiveness; optimizing stealth may reduce ASR.

When Not To Use

  • When all candidate responses are fully curated and not editable by external users.
  • When the judge model is a closed proprietary LLM with unknown behavior and no public prompt template.
  • When you cannot submit multiple iterative trials to learn the judge’s output template.

Failure Modes

  • Human review or manual audits detect and override malicious selections.
  • Aggressive input sanitization or truncation removes or neutralizes the injected suffix.
  • Finetuning the judge on injection-aware data or using ensemble judges reduces single-vector attack success.
  • Perplexity detectors tuned with representative adversarial data may detect some attacks.

Core Entities

Models

  • Mistral-7B-Instruct
  • Llama-2-7B-chat
  • Llama-3-8B-Instruct
  • Openchat-3.5
  • Vicuna-7B
  • Vicuna-13B
  • GPT-3.5-turbo
  • GPT-4

Metrics

  • ASR
  • PAC
  • ACC
  • ASR-B
  • FNR
  • FPR

Datasets

  • MT-Bench
  • LLMBar
  • HH-RLHF
  • MetaTool (tool selection benchmark)

Benchmarks

  • MT-Bench
  • LLMBar
  • MetaTool