Prompt injections can flip automated LLM judges—attacks succeed up to ~74% and committees fix much of it

April 25, 20257 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Narek Maloyan, Dmitry Namiot

Links

Abstract / PDF

Why It Matters For Business

Automated LLM judges can be manipulated by prompt injections, risking wrong evaluations; use committees and layered defenses for high-stakes scoring.

Summary TLDR

The paper evaluates prompt-injection attacks against LLMs used as automated judges. Using four attack styles (including a new Adaptive Search-Based Attack, ASA) across five judge models and four tasks, the authors show attacks can change scores often (ASR up to 73.8%). Smaller open-source judges were more vulnerable than frontier models. Attacks transfer well between similar models. No single defense stops all attacks, but layered defenses and mixed-model committees (5–7 diverse models) cut success rates to ~10–27%. Code and datasets are released.

Problem Statement

LLM-based automated judges are convenient but may be manipulated by malicious inputs (prompt injections). The paper asks: how effective are different injection attacks, how well do defenses work, how transferable are attacks across models, and what practical defenses reduce risk?

Main Contribution

A structured attack framework separating content-author vs system-prompt attacks.

Analysis of top Kaggle competition solutions and distilled practical attack patterns.

A new Adaptive Search-Based Attack (ASA) using genetic search over prompt components.

Large-scale experiments: 5 judge models, 4 tasks, 50 trials per condition, and bootstrap CIs.

Systematic defense evaluation and quantitative evidence that mixed-model committees improve robustness.

Key Findings

Adaptive Search-Based Attack (ASA) is the most effective attack across models.

NumbersASR 42.9–73.8% (Table I); avg 56.2% (Table VIII)

Simple direct instruction injections still work frequently.

NumbersBasic Injection average ASR 46.3% (Table VIII); per-model up to 66.7% (Table I)

Open-source judge models are much more vulnerable than frontier models.

NumbersOpen-source ASR 50.6–68.1% vs frontier 27.0–44.3% (Table VIII)

Attacks transfer strongly between similar open-source models.

NumbersOpen→Open transfer TSR 50.5–62.6% (Table IV)

System-prompt attacks beat content-author attacks by a large margin.

NumbersSystem-prompt ASR higher by ~15 percentage points (Table V)

Mixed-model committees and layered defenses substantially reduce attack success.

Numbers7-model mixed committee ASR reduced to 10.2–19.3% (Table VII); combined defenses evasion 18.5–42.1% (Table VI)

Results

Highest per-model ASR (ASA)

Value73.8% (Gemma-3-4B-Instruct, ASA)

Contextual Misdirection ASR (best case)

Value67.7% (Gemma-3-27B-Instruct, CM)

Open→Open transfer success

Value62.6% (BI) / 61.0% (CM)

Frontier model ASR range

Value28.6–45.7% across attacks (GPT-4, Claude-3-Opus)

7-model mixed committee ASR

Value10.2–19.3% (depends on attack; ASA worst at 19.3%)

Combined defenses evasion (all defenses active)

Value18.5–42.1% evasion rates across attacks

Who Should Care

What To Try In 7 Days

Run simple Basic Injection and ASA attacks on your judge pipeline to measure vulnerability.

Audit and lock down system prompts and templates; rotate access and log changes.

Add instruction-filter regexes and a perplexity-based sanitizer to the input pipeline and test bypass rates.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations limited to text-only tasks and the selected models; multimodal judges not tested.
  • Threat models assume single-step injections; multi-stage real-world attacks may behave differently.
  • Defense implementations cover common approaches but not all novel or proprietary mitigations.

When Not To Use

  • If your evaluation pipeline is already fully offline and air-gapped from external inputs.
  • For models or tasks not represented in the experiments (e.g., multimodal, vision+text judges).

Failure Modes

  • Adaptive attackers can evolve to bypass regex filters and perplexity checks.
  • Committee defense fails if committee members share the same vulnerability family.
  • System-prompt compromise yields much higher success than content-only attacks.

Core Entities

Models

  • Gemma-3-27B-Instruct
  • Gemma-3-4B-Instruct
  • Llama-3.2-3B-Instruct
  • GPT-4
  • Claude-3-Opus

Metrics

  • Attack Success Rate (ASR)
  • Manipulation Magnitude (MM)
  • Transfer Success Rate (TSR)
  • Detection Resistance (DR)

Datasets

  • ppe human preference (Anthropic HH-RLHF)
  • search arena v1 7k
  • mt bench
  • code review (custom, 500 problems)

Benchmarks

  • LLMs: You Can't Please Them All (Kaggle)