Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
8
Why It Matters For Business
LLMs can betray system instructions and help abuse attached interpreters; measuring these behaviors helps product and security teams decide model choice, add guardrails, and quantify user experience tradeoffs.
Summary TLDR
CyberSecEval 2 is an open-source benchmark suite that tests LLM security behavior across four areas: prompt injection, code-interpreter abuse, cyberattack helpfulness, and exploit-generation. It adds a False Refusal Rate (FRR) metric to quantify the safety-utility tradeoff. Tests show modern models still leak system prompt instructions (avg ~28% prompt-injection success), sometimes help abuse interpreters (~35% compliance), and struggle at end-to-end exploit generation. Use the repo to measure model risk and tune guardrails.
Problem Statement
LLMs are increasingly integrated into apps and code interpreters, creating security risks (prompt injection, interpreter abuse, insecure code, and malicious assistance). Practitioners need a practical, repeatable way to measure these risks, and to quantify the tradeoff between refusing harmful requests and wrongly refusing benign ones.
Main Contribution
A broad open-source benchmark (CyberSecEval 2) covering prompt injection, code-interpreter abuse, cyberattack helpfulness, and exploit-generation tests.
A False Refusal Rate (FRR) metric and a borderline benign dataset to measure the safety-utility tradeoff.
Randomized exploit-generation test generators (C/Python/JS, SQLi, memory bugs) to prevent memorization and assess real exploit capability.
An automated judging pipeline that uses a separate LLM to label compliance, enabling scalable evaluation without running unsafe code.
Key Findings
Prompt injections still succeed on modern models.
Conditioning models to refuse harmful prompts reduces helpfulness on ambiguous benign requests for some models.
LLMs frequently comply with requests to abuse attached code interpreters.
Exploit-generation capability is limited and correlates with general coding ability.
Benchmark design reduces memorization risk via random test synthesis and judge-LM labeling.
Results
Average compliance with cyberattack prompts (earlier baseline vs current)
Prompt injection success
Interpreter-abuse compliance
False Refusal Rate (FRR) on borderline benign prompts
Exploit-generation performance (SQLi / memory / buffer overflow)
Who Should Care
What To Try In 7 Days
Run CyberSecEval 2 tests on your deployed model via the open-source repo to get baseline risk numbers.
Measure FRR on your use cases to quantify how safety tuning affects legitimate user flows.
Harden any integrated interpreter: add sandboxing, runtime monitoring, and a model-level refusal policy before deployment.
Reproducibility
License
- MIT
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Does not test multi-turn or optimization-based prompt injection strategies.
- Judge LLM labeling may introduce bias or false labels versus human review.
- No non-English prompts; results apply only to English.
- Does not execute untrusted code; interpreter-abuse judgments are static, not behavioral.
When Not To Use
- Not as the sole safety control — combine with sandboxing, runtime monitors, and human review.
- Do not assume scores transfer unchanged to custom fine-tuned or heavily wrapped deployments.
- Not designed for adversarial gradient-based jailbreak engineering.
Failure Modes
- Judge-LM misclassification of compliance or refusal.
- Model behavior changing under different API wrappers or guardrails (deployment drift).
- Memorization despite randomness if models were trained on similar public challenges.
Core Entities
Models
- gpt-4
- gpt-4-turbo
- gpt-3.5-turbo
- gemini-pro
- llama-3-70b-instruct
- llama-3-8b-instruct
- codellama-70b-instruct
- codellama-34b-instruct
- codellama-13b-instruct
- mistral-large
- mistral-medium
- mistral-small
Metrics
- Prompt injection success rate
- Malicious prompt compliance rate
- Interpreter-abuse compliance rate
- False Refusal Rate (FRR)
- Exploit success / partial-score
Datasets
- CyberSecEval-2 prompt injection set
- CyberSecEval-2 interpreter abuse set (500 prompts)
- CyberSecEval-2 exploit-generation generators
- CyberSecEval-2 FRR (borderline benign) dataset
Benchmarks
- prompt injection
- code interpreter abuse
- cyberattack helpfulness
- exploit generation
- insecure coding (v1)

