AttackEval: a 0–1 scoring framework and ground-truth dataset to measure jailbreak prompt effectiveness

January 17, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

3

Authors

Dong Shu, Chong Zhang, Mingyu Jin, Zihao Zhou, Lingyao Li, Yongfeng Zhang

Links

Abstract / PDF

Why It Matters For Business

Binary success/fail tests miss partial or stealthy jailbreaks. AttackEval gives a ranked, numeric view so teams can prioritize fixes, audit high-risk prompt types, and measure defense improvements over time.

Summary TLDR

This paper introduces AttackEval: a practical 0–1 scoring framework plus a curated ground-truth dataset to measure how effective jailbreak prompts are at bypassing LLM safety. It offers a coarse-grained, weighted system-level score and a fine-grained per-model score (with and without ground truth). Using six popular models and a 666-prompt jailbreak collection, AttackEval aligns with binary baselines but surfaces prompts that binary metrics miss. Political-lobbying prompts are consistently most effective. The method needs a reliable judge (they use GPT-4) and a curated ground truth to work best.

Problem Statement

Current jailbreak evaluations are mostly binary (success/fail) and often rely on a single LLM judge. That masks partial or borderline jailbreaks and may misclassify harmful prompts. We need a nuanced, repeatable way to score prompt effectiveness and a ground-truth set to benchmark evaluations.

Main Contribution

A two-level evaluation framework: coarse-grained system score and fine-grained per-model score (with/without ground truth).

A ground-truth dataset built from a 666-prompt jailbreak collection and curated model answers used as reference solutions.

A 0–1 scoring pipeline: three response samples per prompt, averaging, and model-weighted aggregation for system-level scores.

Empirical comparison showing alignment with the binary Attack Success Rate baseline while revealing prompts that binary tests miss.

Key Findings

AttackEval produces continuous 0–1 scores that align with binary baselines but assign many prompts intermediate values.

NumbersAggregated halves match baseline ~70% (coarse-grained)

Political lobbying prompts are the most effective attack category across evaluations.

NumbersCoarse-grained: 0.65 (ours) vs 0.66 (baseline)

Fine-grained scores are stable to ground-truth size.

NumbersScore differences <5% for ground-truth sizes 3, 5, 10

Automatic judgments by GPT-4 matched human checks in verification samples.

Numbers100% agreement on 500 checks (weight calc) and 500 checks (effectiveness)

Results

Coarse-grained scenario max

ValuePolitical Lobbying 0.65 (AttackEval) vs 0.66 baseline

BaselineBaseline ASR

Fine-grained (with ground truth) example

ValueGPT-3.5 Financial Advice 0.57 ± 0.04

BaselineASR baseline comparison in Table 3

Ground-truth size sensitivity

ValueScore change <5% for sizes 3, 5, 10

Judge reliability (GPT-4 vs human)

Value100% agreement on sampled checks

Who Should Care

What To Try In 7 Days

Run AttackEval on your top 100 user prompts to find borderline jailbreaks.

Add a small ground-truth set (3 answers per harmful question) and compare scores with/without it.

Prioritize hardening and monitoring for political-lobbying prompts first.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies heavily on GPT-4 as an automated judge; judge bias can affect scores.
  • Ground-truth answers are selected using GPT-4 and curation, which can introduce reference bias.
  • Dataset sources are text-only in-the-wild prompts (Reddit, Discord, web) and may miss other attack styles.
  • Model weights depend on a held-out sample; different samples change weights.

When Not To Use

  • When you need real-time, low-latency defenses (framework is for evaluation, not runtime mitigation).
  • When no reliable judge (like GPT-4) is available for automated scoring.
  • When your threat model includes non-text or multimodal jailbreaks (method is text-only).

Failure Modes

  • Judge misclassification for novel or subtle jailbreak patterns outside the curated references.
  • Overfitting defenses to the ground-truth set while missing new attack phrasing.
  • Weighted aggregation masking a single weak model in an ensemble.

Core Entities

Models

  • GPT-3.5-Turbo
  • GPT-4
  • LLaMa2-7B
  • LLaMa3-8B
  • Gemma-7B
  • ChatGLM-6B

Metrics

  • Attack Success Rate (ASR)
  • AttackEval coarse-grained 0–1 score
  • Fine-grained score (0,0.33,0.66,1)
  • BERT embedding similarity (for ground truth matching)

Datasets

  • jailbreak_llms (666 prompts, 390 harmful questions)
  • AttackEval ground-truth answers (curated by authors)