Overview
The framework is practical and tested across six models and a public jailbreak corpus; it needs a reliable judge (they used GPT-4) and clearly benefits from human checks and broader dataset releases.
Citations3
Evidence Strength0.60
Confidence0.90
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Binary success/fail tests miss partial or stealthy jailbreaks. AttackEval gives a ranked, numeric view so teams can prioritize fixes, audit high-risk prompt types, and measure defense improvements over time.
Who Should Care
Summary TLDR
This paper introduces AttackEval: a practical 0–1 scoring framework plus a curated ground-truth dataset to measure how effective jailbreak prompts are at bypassing LLM safety. It offers a coarse-grained, weighted system-level score and a fine-grained per-model score (with and without ground truth). Using six popular models and a 666-prompt jailbreak collection, AttackEval aligns with binary baselines but surfaces prompts that binary metrics miss. Political-lobbying prompts are consistently most effective. The method needs a reliable judge (they use GPT-4) and a curated ground truth to work best.
Problem Statement
Current jailbreak evaluations are mostly binary (success/fail) and often rely on a single LLM judge. That masks partial or borderline jailbreaks and may misclassify harmful prompts. We need a nuanced, repeatable way to score prompt effectiveness and a ground-truth set to benchmark evaluations.
Main Contribution
A two-level evaluation framework: coarse-grained system score and fine-grained per-model score (with/without ground truth).
A ground-truth dataset built from a 666-prompt jailbreak collection and curated model answers used as reference solutions.
Key Findings
AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.
Political lobbying prompts are the most effective attack category across evaluations.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Coarse-grained scenario max | Political Lobbying 0.65 (AttackEval) vs 0.66 baseline | Baseline ASR | −0.01 | Table 2, all prompts | Table 2: coarse-grained averages across 13 scenarios | Table 2 |
| Fine-grained (with ground truth) example | GPT-3.5 Financial Advice 0.57 ± 0.04 | ASR baseline comparison in Table 3 | — | Table 3, per-model per-scenario | Table 3: fine-grained scores with ground truth | Table 3 |
What To Try In 7 Days
Run AttackEval on your top 100 user prompts to find borderline jailbreaks.
Add a small ground-truth set (3 answers per harmful question) and compare scores with/without it.
Prioritize hardening and monitoring for political-lobbying prompts first.
Reproducibility
Risks & Boundaries
Limitations
Relies heavily on GPT-4 as an automated judge; judge bias can affect scores.
Ground-truth answers are selected using GPT-4 and curation, which can introduce reference bias.
When Not To Use
When you need real-time, low-latency defenses (framework is for evaluation, not runtime mitigation).
When no reliable judge (like GPT-4) is available for automated scoring.
Failure Modes
Judge misclassification for novel or subtle jailbreak patterns outside the curated references.
Overfitting defenses to the ground-truth set while missing new attack phrasing.

