Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Automated LLM judges can be manipulated by prompt injections, risking wrong evaluations; use committees and layered defenses for high-stakes scoring.
Summary TLDR
The paper evaluates prompt-injection attacks against LLMs used as automated judges. Using four attack styles (including a new Adaptive Search-Based Attack, ASA) across five judge models and four tasks, the authors show attacks can change scores often (ASR up to 73.8%). Smaller open-source judges were more vulnerable than frontier models. Attacks transfer well between similar models. No single defense stops all attacks, but layered defenses and mixed-model committees (5–7 diverse models) cut success rates to ~10–27%. Code and datasets are released.
Problem Statement
LLM-based automated judges are convenient but may be manipulated by malicious inputs (prompt injections). The paper asks: how effective are different injection attacks, how well do defenses work, how transferable are attacks across models, and what practical defenses reduce risk?
Main Contribution
A structured attack framework separating content-author vs system-prompt attacks.
Analysis of top Kaggle competition solutions and distilled practical attack patterns.
A new Adaptive Search-Based Attack (ASA) using genetic search over prompt components.
Large-scale experiments: 5 judge models, 4 tasks, 50 trials per condition, and bootstrap CIs.
Systematic defense evaluation and quantitative evidence that mixed-model committees improve robustness.
Key Findings
Adaptive Search-Based Attack (ASA) is the most effective attack across models.
Simple direct instruction injections still work frequently.
Open-source judge models are much more vulnerable than frontier models.
Attacks transfer strongly between similar open-source models.
System-prompt attacks beat content-author attacks by a large margin.
Mixed-model committees and layered defenses substantially reduce attack success.
Results
Highest per-model ASR (ASA)
Contextual Misdirection ASR (best case)
Open→Open transfer success
Frontier model ASR range
7-model mixed committee ASR
Combined defenses evasion (all defenses active)
Who Should Care
What To Try In 7 Days
Run simple Basic Injection and ASA attacks on your judge pipeline to measure vulnerability.
Audit and lock down system prompts and templates; rotate access and log changes.
Add instruction-filter regexes and a perplexity-based sanitizer to the input pipeline and test bypass rates.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluations limited to text-only tasks and the selected models; multimodal judges not tested.
- Threat models assume single-step injections; multi-stage real-world attacks may behave differently.
- Defense implementations cover common approaches but not all novel or proprietary mitigations.
When Not To Use
- If your evaluation pipeline is already fully offline and air-gapped from external inputs.
- For models or tasks not represented in the experiments (e.g., multimodal, vision+text judges).
Failure Modes
- Adaptive attackers can evolve to bypass regex filters and perplexity checks.
- Committee defense fails if committee members share the same vulnerability family.
- System-prompt compromise yields much higher success than content-only attacks.
Core Entities
Models
- Gemma-3-27B-Instruct
- Gemma-3-4B-Instruct
- Llama-3.2-3B-Instruct
- GPT-4
- Claude-3-Opus
Metrics
- Attack Success Rate (ASR)
- Manipulation Magnitude (MM)
- Transfer Success Rate (TSR)
- Detection Resistance (DR)
Datasets
- ppe human preference (Anthropic HH-RLHF)
- search arena v1 7k
- mt bench
- code review (custom, 500 problems)
Benchmarks
- LLMs: You Can't Please Them All (Kaggle)

