Overview
The experiments are broad and statistically rigorous, showing clear attack and defense trends; applicability depends on model set and real-world pipeline differences.
Citations0
Evidence Strength0.90
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 50%
Novelty: 60%
Why It Matters For Business
Automated LLM judges can be manipulated by prompt injections, risking wrong evaluations; use committees and layered defenses for high-stakes scoring.
Who Should Care
Summary TLDR
The paper evaluates prompt-injection attacks against LLMs used as automated judges. Using four attack styles (including a new Adaptive Search-Based Attack, ASA) across five judge models and four tasks, the authors show attacks can change scores often (ASR up to 73.8%). Smaller open-source judges were more vulnerable than frontier models. Attacks transfer well between similar models. No single defense stops all attacks, but layered defenses and mixed-model committees (5–7 diverse models) cut success rates to ~10–27%. Code and datasets are released.
Problem Statement
LLM-based automated judges are convenient but may be manipulated by malicious inputs (prompt injections). The paper asks: how effective are different injection attacks, how well do defenses work, how transferable are attacks across models, and what practical defenses reduce risk?
Main Contribution
A structured attack framework separating content-author vs system-prompt attacks.
Analysis of top Kaggle competition solutions and distilled practical attack patterns.
Key Findings
Adaptive Search-Based Attack (ASA) is the most effective attack across models.
Simple direct instruction injections still work frequently.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Highest per-model ASR (ASA) | 73.8% (Gemma-3-4B-Instruct, ASA) | — | — | Table I (averaged across tasks) | ASA reached 73.8% ASR on Gemma-3-4B-Instruct | Table I |
| Contextual Misdirection ASR (best case) | 67.7% (Gemma-3-27B-Instruct, CM) | — | — | Table I (averaged across tasks) | CM achieved 67.7% ASR versus Gemma-3-27B-Instruct | Table I |
What To Try In 7 Days
Run simple Basic Injection and ASA attacks on your judge pipeline to measure vulnerability.
Audit and lock down system prompts and templates; rotate access and log changes.
Add instruction-filter regexes and a perplexity-based sanitizer to the input pipeline and test bypass rates.
Reproducibility
Risks & Boundaries
Limitations
Evaluations limited to text-only tasks and the selected models; multimodal judges not tested.
Threat models assume single-step injections; multi-stage real-world attacks may behave differently.
When Not To Use
If your evaluation pipeline is already fully offline and air-gapped from external inputs.
For models or tasks not represented in the experiments (e.g., multimodal, vision+text judges).
Failure Modes
Adaptive attackers can evolve to bypass regex filters and perplexity checks.
Committee defense fails if committee members share the same vulnerability family.

