Prompt injections can flip automated LLM judges—attacks succeed up to ~74% and committees fix much of it

Overview

Decision SnapshotReady For Pilot

The experiments are broad and statistically rigorous, showing clear attack and defense trends; applicability depends on model set and real-world pipeline differences.

Citations0

Evidence Strength0.90

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Narek Maloyan, Dmitry Namiot

Links

Abstract / PDF

Why It Matters For Business

Automated LLM judges can be manipulated by prompt injections, risking wrong evaluations; use committees and layered defenses for high-stakes scoring.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper evaluates prompt-injection attacks against LLMs used as automated judges. Using four attack styles (including a new Adaptive Search-Based Attack, ASA) across five judge models and four tasks, the authors show attacks can change scores often (ASR up to 73.8%). Smaller open-source judges were more vulnerable than frontier models. Attacks transfer well between similar models. No single defense stops all attacks, but layered defenses and mixed-model committees (5–7 diverse models) cut success rates to ~10–27%. Code and datasets are released.

Problem Statement

LLM-based automated judges are convenient but may be manipulated by malicious inputs (prompt injections). The paper asks: how effective are different injection attacks, how well do defenses work, how transferable are attacks across models, and what practical defenses reduce risk?

Main Contribution

A structured attack framework separating content-author vs system-prompt attacks.

Analysis of top Kaggle competition solutions and distilled practical attack patterns.

Key Findings

Adaptive Search-Based Attack (ASA) is the most effective attack across models.

NumbersASR 42.9–73.8% (Table I); avg 56.2% (Table VIII)

Practical UseExpect optimized, iterative prompt-search attacks to be the hardest to block; include adversarial testing with adaptive attacks before deployment.

Evidence RefTable I, Table VIII, Abstract

Simple direct instruction injections still work frequently.

NumbersBasic Injection average ASR 46.3% (Table VIII); per-model up to 66.7% (Table I)

Practical UseDo not assume simple filters suffice: add layered detection and explicit instruction sanitization for user-submitted content.

Evidence RefTable I, Table VIII, Discussion VI.A.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Highest per-model ASR (ASA)	73.8% (Gemma-3-4B-Instruct, ASA)	—	—	Table I (averaged across tasks)	ASA reached 73.8% ASR on Gemma-3-4B-Instruct	Table I
Contextual Misdirection ASR (best case)	67.7% (Gemma-3-27B-Instruct, CM)	—	—	Table I (averaged across tasks)	CM achieved 67.7% ASR versus Gemma-3-27B-Instruct	Table I

What To Try In 7 Days

Run simple Basic Injection and ASA attacks on your judge pipeline to measure vulnerability.

Audit and lock down system prompts and templates; rotate access and log changes.

Add instruction-filter regexes and a perplexity-based sanitizer to the input pipeline and test bypass rates.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Evaluations limited to text-only tasks and the selected models; multimodal judges not tested.

Threat models assume single-step injections; multi-stage real-world attacks may behave differently.

When Not To Use

If your evaluation pipeline is already fully offline and air-gapped from external inputs.

For models or tasks not represented in the experiments (e.g., multimodal, vision+text judges).

Failure Modes

Adaptive attackers can evolve to bypass regex filters and perplexity checks.

Committee defense fails if committee members share the same vulnerability family.

Core Entities

Models

Gemma-3-27B-InstructGemma-3-4B-InstructLlama-3.2-3B-InstructGPT-4Claude-3-Opus

Metrics

Attack Success Rate (ASR)Manipulation Magnitude (MM)Transfer Success Rate (TSR)Detection Resistance (DR)

Datasets

ppe human preference (Anthropic HH-RLHF)search arena v1 7kmt benchcode review (custom, 500 problems)

Benchmarks

LLMs: You Can't Please Them All (Kaggle)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adaptive Search-Based Attack (ASA) is the most effective attack across models.

Simple direct instruction injections still work frequently.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding