Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
3
Why It Matters For Business
Binary success/fail tests miss partial or stealthy jailbreaks. AttackEval gives a ranked, numeric view so teams can prioritize fixes, audit high-risk prompt types, and measure defense improvements over time.
Summary TLDR
This paper introduces AttackEval: a practical 0–1 scoring framework plus a curated ground-truth dataset to measure how effective jailbreak prompts are at bypassing LLM safety. It offers a coarse-grained, weighted system-level score and a fine-grained per-model score (with and without ground truth). Using six popular models and a 666-prompt jailbreak collection, AttackEval aligns with binary baselines but surfaces prompts that binary metrics miss. Political-lobbying prompts are consistently most effective. The method needs a reliable judge (they use GPT-4) and a curated ground truth to work best.
Problem Statement
Current jailbreak evaluations are mostly binary (success/fail) and often rely on a single LLM judge. That masks partial or borderline jailbreaks and may misclassify harmful prompts. We need a nuanced, repeatable way to score prompt effectiveness and a ground-truth set to benchmark evaluations.
Main Contribution
A two-level evaluation framework: coarse-grained system score and fine-grained per-model score (with/without ground truth).
A ground-truth dataset built from a 666-prompt jailbreak collection and curated model answers used as reference solutions.
A 0–1 scoring pipeline: three response samples per prompt, averaging, and model-weighted aggregation for system-level scores.
Empirical comparison showing alignment with the binary Attack Success Rate baseline while revealing prompts that binary tests miss.
Key Findings
AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.
Political lobbying prompts are the most effective attack category across evaluations.
Fine-grained scores are stable to ground-truth size.
Automatic judgments by GPT-4 matched human checks in verification samples.
Results
Coarse-grained scenario max
Fine-grained (with ground truth) example
Ground-truth size sensitivity
Judge reliability (GPT-4 vs human)
Who Should Care
What To Try In 7 Days
Run AttackEval on your top 100 user prompts to find borderline jailbreaks.
Add a small ground-truth set (3 answers per harmful question) and compare scores with/without it.
Prioritize hardening and monitoring for political-lobbying prompts first.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies heavily on GPT-4 as an automated judge; judge bias can affect scores.
- Ground-truth answers are selected using GPT-4 and curation, which can introduce reference bias.
- Dataset sources are text-only in-the-wild prompts (Reddit, Discord, web) and may miss other attack styles.
- Model weights depend on a held-out sample; different samples change weights.
When Not To Use
- When you need real-time, low-latency defenses (framework is for evaluation, not runtime mitigation).
- When no reliable judge (like GPT-4) is available for automated scoring.
- When your threat model includes non-text or multimodal jailbreaks (method is text-only).
Failure Modes
- Judge misclassification for novel or subtle jailbreak patterns outside the curated references.
- Overfitting defenses to the ground-truth set while missing new attack phrasing.
- Weighted aggregation masking a single weak model in an ensemble.
Core Entities
Models
- GPT-3.5-Turbo
- GPT-4
- LLaMa2-7B
- LLaMa3-8B
- Gemma-7B
- ChatGLM-6B
Metrics
- Attack Success Rate (ASR)
- AttackEval coarse-grained 0–1 score
- Fine-grained score (0,0.33,0.66,1)
- BERT embedding similarity (for ground truth matching)
Datasets
- jailbreak_llms (666 prompts, 390 harmful questions)
- AttackEval ground-truth answers (curated by authors)

