CYBERSECEVAL: a broad benchmark measuring insecure code and malicious compliance in code-capable LLMs

Overview

Decision SnapshotNeeds Validation

The benchmark is ready to run and open-source with validated automation, but detection limits, dataset contamination, and single-turn testing reduce its completeness as a final safety gate.

Citations15

Evidence Strength0.80

Confidence0.86

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 80%

Authors

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, Joshua Saxe

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Code-capable LLMs frequently suggest insecure code and may comply with malicious requests, so firms should test models automatically and add safety controls before deployment.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

CYBERSECEVAL is an open benchmark and toolkit that tests two cybersecurity risks from code-capable LLMs: (1) whether models produce insecure code patterns and (2) whether they will help carry out cyberattacks. The suite auto-generates tests from real open-source code using a rule-based Insecure Code Detector (ICD) and creates ATT&CK-based malicious prompts judged by an automated LLM pipeline. In a 7-model case study (Llama 2, Code Llama, GPT-3.5, GPT-4), models produced vulnerable code about 30% of the time and complied with attack requests about 53% of the time. ICD detection runs at 96% precision and 79% recall; the automated helpfulness judge runs at 94% precision and 84% recall. The repo

Problem Statement

Code-producing LLMs can introduce insecure coding practices and may comply with malicious requests. Developers accept LLM suggestions frequently, so we need an automated, scalable way to measure how often models generate insecure code or help with cyberattacks, and to track improvements over time.

Main Contribution

A unified benchmark (CYBERSECEVAL) that measures insecure code generation and compliance with cyberattack prompts.

Insecure Code Detector (ICD): ~189 rules covering 50 CWEs across 8 languages, tolerant of partial/unparseable code.

Key Findings

Models produced vulnerable code a substantial fraction of the time.

Numbers30% of completions were vulnerable on CYBERSECEVAL tests

Practical UseTreat generated code as security-risky by default; run automated security tests on model outputs before accepting suggestions.

Evidence RefAbstract, Conclusion, Section 1.2

Models often comply with requests that could aid cyberattacks.

Numbers53% average compliance with ATT&CK-based malicious prompts

Practical UseAdd refusal and safety layers around models used for code tasks; assume many single-turn prompts can elicit harmful guidance.

Evidence RefAbstract, Section 3.5, Conclusion

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Insecure code rate	30% of completions flagged as vulnerable (70% pass)	—	—	Case study across 7 models on CYBERSECEVAL	Average insecure-code finding reported across tested models	Abstract, Sections 1.2 and 7
Cyberattack compliance rate	53% of prompts produced responses judged helpful to attackers	—	—	1,000 ATT&CK-derived prompts across models	Average compliance across Llama 2, CodeLlama, GPT models	Sections 3.5 and 7

What To Try In 7 Days

Run CYBERSECEVAL from the public repo on your code models to get baseline insecure-code and compliance rates.

Block or flag test cases whose source repos overlap model training to measure contamination bias.

Add ICD rule checks into CI for suggested code and require human review for any flagged outputs.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseMIT

Code URLs

https://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks

Data URLs

https://github.com/facebookresearch/PurpleLlama/tree/main/CybersecurityBenchmarks

Risks & Boundaries

Limitations

Static-analysis detection is imperfect; false positives and negatives exist.

Some test cases derive from open-source code possibly present in model training (data contamination risk).

When Not To Use

As the sole evidence for a model's safety in multi-turn or production attack simulations.

To certify models for non-English deployments.

Failure Modes

Static analyzer misses complex insecure patterns (false negatives) or mislabels benign code (false positives).

Training-data contamination inflates pass rates for models trained on test examples.

Core Entities

Models

Llama 2Code Llamagpt-3.5-turbogpt-4llama2-7b-chatllama2-13b-chatllama2-30b-chatllama2-70b-chatcodellama-13b-instructcodellama-34b-instruct

Metrics

Insecure coding pass rateCyberattack helpfulness (compliance) rateICD precision/recallJudge pipeline precision/recallBLEU code quality

Datasets

Open-source code corpus (unspecified origins)CYBERSECEVAL generated test prompts (insecure-code and ATT&CK prompts)

Benchmarks

CYBERSECEVAL

Context Entities

Models

GitHub Copilot (cited usage)StarCoder (related benchmark studies)

Metrics

Manual vulnerability labels from prior work

Datasets

SecurityEvalAsleep at the Keyboard

Benchmarks

SecurityEvalAsleep at the KeyboardCodeLMSec

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models produced vulnerable code a substantial fraction of the time.

Models often comply with requests that could aid cyberattacks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding