AttackEval: a 0–1 scoring framework and ground-truth dataset to measure jailbreak prompt effectiveness

Overview

Decision SnapshotNeeds Validation

The framework is practical and tested across six models and a public jailbreak corpus; it needs a reliable judge (they used GPT-4) and clearly benefits from human checks and broader dataset releases.

Citations3

Evidence Strength0.60

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Dong Shu, Chong Zhang, Mingyu Jin, Zihao Zhou, Lingyao Li, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

Binary success/fail tests miss partial or stealthy jailbreaks. AttackEval gives a ranked, numeric view so teams can prioritize fixes, audit high-risk prompt types, and measure defense improvements over time.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

This paper introduces AttackEval: a practical 0–1 scoring framework plus a curated ground-truth dataset to measure how effective jailbreak prompts are at bypassing LLM safety. It offers a coarse-grained, weighted system-level score and a fine-grained per-model score (with and without ground truth). Using six popular models and a 666-prompt jailbreak collection, AttackEval aligns with binary baselines but surfaces prompts that binary metrics miss. Political-lobbying prompts are consistently most effective. The method needs a reliable judge (they use GPT-4) and a curated ground truth to work best.

Problem Statement

Current jailbreak evaluations are mostly binary (success/fail) and often rely on a single LLM judge. That masks partial or borderline jailbreaks and may misclassify harmful prompts. We need a nuanced, repeatable way to score prompt effectiveness and a ground-truth set to benchmark evaluations.

Main Contribution

A two-level evaluation framework: coarse-grained system score and fine-grained per-model score (with/without ground truth).

A ground-truth dataset built from a 666-prompt jailbreak collection and curated model answers used as reference solutions.

Key Findings

AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.

NumbersAggregated halves match baseline ~70% (coarse-grained)

Practical UseUse AttackEval to rank and triage borderline prompts that binary ASR hides.

Evidence RefSection 4.2, Figure 1

Political lobbying prompts are the most effective attack category across evaluations.

NumbersCoarse-grained: 0.65 (ours) vs 0.66 (baseline)

Practical UsePrioritize defenses and audits for political-lobbying style prompts.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Coarse-grained scenario max	Political Lobbying 0.65 (AttackEval) vs 0.66 baseline	Baseline ASR	−0.01	Table 2, all prompts	Table 2: coarse-grained averages across 13 scenarios	Table 2
Fine-grained (with ground truth) example	GPT-3.5 Financial Advice 0.57 ± 0.04	ASR baseline comparison in Table 3	—	Table 3, per-model per-scenario	Table 3: fine-grained scores with ground truth	Table 3

What To Try In 7 Days

Run AttackEval on your top 100 user prompts to find borderline jailbreaks.

Add a small ground-truth set (3 answers per harmful question) and compare scores with/without it.

Prioritize hardening and monitoring for political-lobbying prompts first.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Relies heavily on GPT-4 as an automated judge; judge bias can affect scores.

Ground-truth answers are selected using GPT-4 and curation, which can introduce reference bias.

When Not To Use

When you need real-time, low-latency defenses (framework is for evaluation, not runtime mitigation).

When no reliable judge (like GPT-4) is available for automated scoring.

Failure Modes

Judge misclassification for novel or subtle jailbreak patterns outside the curated references.

Overfitting defenses to the ground-truth set while missing new attack phrasing.

Core Entities

Models

GPT-3.5-TurboGPT-4LLaMa2-7BLLaMa3-8BGemma-7BChatGLM-6B

Metrics

Attack Success Rate (ASR)AttackEval coarse-grained 0–1 scoreFine-grained score (0,0.33,0.66,1)BERT embedding similarity (for ground truth matching)

Datasets

jailbreak_llms (666 prompts, 390 harmful questions)AttackEval ground-truth answers (curated by authors)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.

Political lobbying prompts are the most effective attack category across evaluations.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

AdversaRiskQA: adversarial factuality benchmark for health, finance, and law

Key finding

Short, natural-looking token sequences can flip LLM judges to say 'Yes' on wrong answers; discovery and a small LoRA defense

Key finding

FACT-BENCH: a 20K-question benchmark that reveals when LLMs forget facts and how exemplars can make them lie

Key finding

RWKU: a stress test for forgetting real-world facts in LLMs using 200 real-person targets and adversarial probes

Key finding

Short adversarial suffixes can flip LLM-as-a-Judge decisions; CUA >30% success

Key finding