AttackEval: a 0–1 scoring framework and ground-truth dataset to measure jailbreak prompt effectiveness

January 17, 20247 min

Overview

Decision SnapshotNeeds Validation

The framework is practical and tested across six models and a public jailbreak corpus; it needs a reliable judge (they used GPT-4) and clearly benefits from human checks and broader dataset releases.

Citations3

Evidence Strength0.60

Confidence0.90

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 60%

Novelty: 50%

Authors

Dong Shu, Chong Zhang, Mingyu Jin, Zihao Zhou, Lingyao Li, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

Binary success/fail tests miss partial or stealthy jailbreaks. AttackEval gives a ranked, numeric view so teams can prioritize fixes, audit high-risk prompt types, and measure defense improvements over time.

Who Should Care

Summary TLDR

This paper introduces AttackEval: a practical 0–1 scoring framework plus a curated ground-truth dataset to measure how effective jailbreak prompts are at bypassing LLM safety. It offers a coarse-grained, weighted system-level score and a fine-grained per-model score (with and without ground truth). Using six popular models and a 666-prompt jailbreak collection, AttackEval aligns with binary baselines but surfaces prompts that binary metrics miss. Political-lobbying prompts are consistently most effective. The method needs a reliable judge (they use GPT-4) and a curated ground truth to work best.

Problem Statement

Current jailbreak evaluations are mostly binary (success/fail) and often rely on a single LLM judge. That masks partial or borderline jailbreaks and may misclassify harmful prompts. We need a nuanced, repeatable way to score prompt effectiveness and a ground-truth set to benchmark evaluations.

Main Contribution

A two-level evaluation framework: coarse-grained system score and fine-grained per-model score (with/without ground truth).

A ground-truth dataset built from a 666-prompt jailbreak collection and curated model answers used as reference solutions.

Key Findings

AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.

NumbersAggregated halves match baseline ~70% (coarse-grained)

Practical UseUse AttackEval to rank and triage borderline prompts that binary ASR hides.

Evidence RefSection 4.2, Figure 1

Political lobbying prompts are the most effective attack category across evaluations.

NumbersCoarse-grained: 0.65 (ours) vs 0.66 (baseline)

Practical UsePrioritize defenses and audits for political-lobbying style prompts.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Coarse-grained scenario maxPolitical Lobbying 0.65 (AttackEval) vs 0.66 baselineBaseline ASR−0.01Table 2, all promptsTable 2: coarse-grained averages across 13 scenariosTable 2
Fine-grained (with ground truth) exampleGPT-3.5 Financial Advice 0.57 ± 0.04ASR baseline comparison in Table 3Table 3, per-model per-scenarioTable 3: fine-grained scores with ground truthTable 3

What To Try In 7 Days

Run AttackEval on your top 100 user prompts to find borderline jailbreaks.

Add a small ground-truth set (3 answers per harmful question) and compare scores with/without it.

Prioritize hardening and monitoring for political-lobbying prompts first.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies heavily on GPT-4 as an automated judge; judge bias can affect scores.

Ground-truth answers are selected using GPT-4 and curation, which can introduce reference bias.

When Not To Use

When you need real-time, low-latency defenses (framework is for evaluation, not runtime mitigation).

When no reliable judge (like GPT-4) is available for automated scoring.

Failure Modes

Judge misclassification for novel or subtle jailbreak patterns outside the curated references.

Overfitting defenses to the ground-truth set while missing new attack phrasing.

Core Entities

Models

GPT-3.5-TurboGPT-4LLaMa2-7BLLaMa3-8BGemma-7BChatGLM-6B

Metrics

Attack Success Rate (ASR)AttackEval coarse-grained 0–1 scoreFine-grained score (0,0.33,0.66,1)BERT embedding similarity (for ground truth matching)

Datasets

jailbreak_llms (666 prompts, 390 harmful questions)AttackEval ground-truth answers (curated by authors)