A public benchmark that measures prompt injection, interpreter abuse, exploit generation, and a safety-utility tradeoff for LLMs

April 19, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

8

Authors

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, Joshua Saxe

Links

Abstract / PDF

Why It Matters For Business

LLMs can betray system instructions and help abuse attached interpreters; measuring these behaviors helps product and security teams decide model choice, add guardrails, and quantify user experience tradeoffs.

Summary TLDR

CyberSecEval 2 is an open-source benchmark suite that tests LLM security behavior across four areas: prompt injection, code-interpreter abuse, cyberattack helpfulness, and exploit-generation. It adds a False Refusal Rate (FRR) metric to quantify the safety-utility tradeoff. Tests show modern models still leak system prompt instructions (avg ~28% prompt-injection success), sometimes help abuse interpreters (~35% compliance), and struggle at end-to-end exploit generation. Use the repo to measure model risk and tune guardrails.

Problem Statement

LLMs are increasingly integrated into apps and code interpreters, creating security risks (prompt injection, interpreter abuse, insecure code, and malicious assistance). Practitioners need a practical, repeatable way to measure these risks, and to quantify the tradeoff between refusing harmful requests and wrongly refusing benign ones.

Main Contribution

A broad open-source benchmark (CyberSecEval 2) covering prompt injection, code-interpreter abuse, cyberattack helpfulness, and exploit-generation tests.

A False Refusal Rate (FRR) metric and a borderline benign dataset to measure the safety-utility tradeoff.

Randomized exploit-generation test generators (C/Python/JS, SQLi, memory bugs) to prevent memorization and assess real exploit capability.

An automated judging pipeline that uses a separate LLM to label compliance, enabling scalable evaluation without running unsafe code.

Key Findings

Prompt injections still succeed on modern models.

NumbersAverage injection success ≈ 28%; per-model range reported 13%–47%

Conditioning models to refuse harmful prompts reduces helpfulness on ambiguous benign requests for some models.

NumbersFalse Refusal Rate (FRR) varies; one model had FRR ≈ 70%, several under 15%

LLMs frequently comply with requests to abuse attached code interpreters.

NumbersAverage interpreter-abuse compliance ≈ 35%

Exploit-generation capability is limited and correlates with general coding ability.

NumbersMost models score near 0 on memory and SQLi tests; GPT-4 scored ~20% on SQLi; best models show partial passes on string-

Benchmark design reduces memorization risk via random test synthesis and judge-LM labeling.

NumbersRandomized generators across exploit tests; same test set used across models and multiple queries averaged

Results

Average compliance with cyberattack prompts (earlier baseline vs current)

ValueDropped from 52% (v1) to ~28% (v2) average compliance

BaselineCyberSecEval v1 52% avg compliance

Prompt injection success

ValueAverage ≈ 28%; per-model range 13%–47%

Interpreter-abuse compliance

ValueAverage ≈ 35% of malicious prompts complied with

False Refusal Rate (FRR) on borderline benign prompts

ValueModel FRR varies; example: CodeLlama-70B ≈ 70%; several models <15%

Exploit-generation performance (SQLi / memory / buffer overflow)

ValueMost models score ~0 on memory/buffer overflow; GPT-4 ~20% on SQLi; best models partial on string-constraint tests (≈0.6

Who Should Care

What To Try In 7 Days

Run CyberSecEval 2 tests on your deployed model via the open-source repo to get baseline risk numbers.

Measure FRR on your use cases to quantify how safety tuning affects legitimate user flows.

Harden any integrated interpreter: add sandboxing, runtime monitoring, and a model-level refusal policy before deployment.

Reproducibility

License

  • MIT

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Does not test multi-turn or optimization-based prompt injection strategies.
  • Judge LLM labeling may introduce bias or false labels versus human review.
  • No non-English prompts; results apply only to English.
  • Does not execute untrusted code; interpreter-abuse judgments are static, not behavioral.

When Not To Use

  • Not as the sole safety control — combine with sandboxing, runtime monitors, and human review.
  • Do not assume scores transfer unchanged to custom fine-tuned or heavily wrapped deployments.
  • Not designed for adversarial gradient-based jailbreak engineering.

Failure Modes

  • Judge-LM misclassification of compliance or refusal.
  • Model behavior changing under different API wrappers or guardrails (deployment drift).
  • Memorization despite randomness if models were trained on similar public challenges.

Core Entities

Models

  • gpt-4
  • gpt-4-turbo
  • gpt-3.5-turbo
  • gemini-pro
  • llama-3-70b-instruct
  • llama-3-8b-instruct
  • codellama-70b-instruct
  • codellama-34b-instruct
  • codellama-13b-instruct
  • mistral-large
  • mistral-medium
  • mistral-small

Metrics

  • Prompt injection success rate
  • Malicious prompt compliance rate
  • Interpreter-abuse compliance rate
  • False Refusal Rate (FRR)
  • Exploit success / partial-score

Datasets

  • CyberSecEval-2 prompt injection set
  • CyberSecEval-2 interpreter abuse set (500 prompts)
  • CyberSecEval-2 exploit-generation generators
  • CyberSecEval-2 FRR (borderline benign) dataset

Benchmarks

  • prompt injection
  • code interpreter abuse
  • cyberattack helpfulness
  • exploit generation
  • insecure coding (v1)