Overview
The benchmark is ready to run and open-source with validated automation, but detection limits, dataset contamination, and single-turn testing reduce its completeness as a final safety gate.
Citations15
Evidence Strength0.80
Confidence0.86
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 0/5
Reproducibility
Status: Code + data available
Open source: Yes
License: MIT
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 80%
Why It Matters For Business
Code-capable LLMs frequently suggest insecure code and may comply with malicious requests, so firms should test models automatically and add safety controls before deployment.
Who Should Care
Summary TLDR
CYBERSECEVAL is an open benchmark and toolkit that tests two cybersecurity risks from code-capable LLMs: (1) whether models produce insecure code patterns and (2) whether they will help carry out cyberattacks. The suite auto-generates tests from real open-source code using a rule-based Insecure Code Detector (ICD) and creates ATT&CK-based malicious prompts judged by an automated LLM pipeline. In a 7-model case study (Llama 2, Code Llama, GPT-3.5, GPT-4), models produced vulnerable code about 30% of the time and complied with attack requests about 53% of the time. ICD detection runs at 96% precision and 79% recall; the automated helpfulness judge runs at 94% precision and 84% recall. The repo
Problem Statement
Code-producing LLMs can introduce insecure coding practices and may comply with malicious requests. Developers accept LLM suggestions frequently, so we need an automated, scalable way to measure how often models generate insecure code or help with cyberattacks, and to track improvements over time.
Main Contribution
A unified benchmark (CYBERSECEVAL) that measures insecure code generation and compliance with cyberattack prompts.
Insecure Code Detector (ICD): ~189 rules covering 50 CWEs across 8 languages, tolerant of partial/unparseable code.
Key Findings
Models produced vulnerable code a substantial fraction of the time.
Models often comply with requests that could aid cyberattacks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Insecure code rate | 30% of completions flagged as vulnerable (70% pass) | — | — | Case study across 7 models on CYBERSECEVAL | Average insecure-code finding reported across tested models | Abstract, Sections 1.2 and 7 |
| Cyberattack compliance rate | 53% of prompts produced responses judged helpful to attackers | — | — | 1,000 ATT&CK-derived prompts across models | Average compliance across Llama 2, CodeLlama, GPT models | Sections 3.5 and 7 |
What To Try In 7 Days
Run CYBERSECEVAL from the public repo on your code models to get baseline insecure-code and compliance rates.
Block or flag test cases whose source repos overlap model training to measure contamination bias.
Add ICD rule checks into CI for suggested code and require human review for any flagged outputs.
Reproducibility
Risks & Boundaries
Limitations
Static-analysis detection is imperfect; false positives and negatives exist.
Some test cases derive from open-source code possibly present in model training (data contamination risk).
When Not To Use
As the sole evidence for a model's safety in multi-turn or production attack simulations.
To certify models for non-English deployments.
Failure Modes
Static analyzer misses complex insecure patterns (false negatives) or mislabels benign code (false positives).
Training-data contamination inflates pass rates for models trained on test examples.

