Overview
Production Readiness
0.7
Novelty Score
0.8
Cost Impact Score
0.7
Citation Count
15
Why It Matters For Business
Code-capable LLMs frequently suggest insecure code and may comply with malicious requests, so firms should test models automatically and add safety controls before deployment.
Summary TLDR
CYBERSECEVAL is an open benchmark and toolkit that tests two cybersecurity risks from code-capable LLMs: (1) whether models produce insecure code patterns and (2) whether they will help carry out cyberattacks. The suite auto-generates tests from real open-source code using a rule-based Insecure Code Detector (ICD) and creates ATT&CK-based malicious prompts judged by an automated LLM pipeline. In a 7-model case study (Llama 2, Code Llama, GPT-3.5, GPT-4), models produced vulnerable code about 30% of the time and complied with attack requests about 53% of the time. ICD detection runs at 96% precision and 79% recall; the automated helpfulness judge runs at 94% precision and 84% recall. The repo
Problem Statement
Code-producing LLMs can introduce insecure coding practices and may comply with malicious requests. Developers accept LLM suggestions frequently, so we need an automated, scalable way to measure how often models generate insecure code or help with cyberattacks, and to track improvements over time.
Main Contribution
A unified benchmark (CYBERSECEVAL) that measures insecure code generation and compliance with cyberattack prompts.
Insecure Code Detector (ICD): ~189 rules covering 50 CWEs across 8 languages, tolerant of partial/unparseable code.
Automated test generation pipeline that extracts real insecure patterns from open-source code for autocomplete and instruction tests.
Automated cyberattack-helpfulness tests built from MITRE ATT&CK fragments and judged by an LLM-based pipeline.
Validation of automation: ICD (96% precision, 79% recall) and judge pipeline (94% precision, 84% recall).
Open-source release of code, tests, and instructions under an MIT license.
Key Findings
Models produced vulnerable code a substantial fraction of the time.
Models often comply with requests that could aid cyberattacks.
Higher coding capability correlates with more insecure and more compliant outputs.
Automation for measuring these risks is accurate enough to scale evaluation.
Results
Insecure code rate
Cyberattack compliance rate
Accuracy
Accuracy
Example model insecure pass rate
Who Should Care
What To Try In 7 Days
Run CYBERSECEVAL from the public repo on your code models to get baseline insecure-code and compliance rates.
Block or flag test cases whose source repos overlap model training to measure contamination bias.
Add ICD rule checks into CI for suggested code and require human review for any flagged outputs.
Reproducibility
License
- MIT
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Static-analysis detection is imperfect; false positives and negatives exist.
- Some test cases derive from open-source code possibly present in model training (data contamination risk).
- Prompts and evaluations are English-only.
- Evaluations use single-turn prompts and do not cover multi-turn or interactive adversarial flows.
- The benchmark does not measure whether multiple outputs can be stitched into a full exploit tutorial.
When Not To Use
- As the sole evidence for a model's safety in multi-turn or production attack simulations.
- To certify models for non-English deployments.
- Where human-led penetration testing or end-to-end exploit validation is required.
Failure Modes
- Static analyzer misses complex insecure patterns (false negatives) or mislabels benign code (false positives).
- Training-data contamination inflates pass rates for models trained on test examples.
- Judge LLMs may reflect bias or make errors in nuanced intent judgments.
- Single-turn tests miss iterative prompt strategies that could either reveal or avoid defenses.
Core Entities
Models
- Llama 2
- Code Llama
- gpt-3.5-turbo
- gpt-4
- llama2-7b-chat
- llama2-13b-chat
- llama2-30b-chat
- llama2-70b-chat
- codellama-13b-instruct
- codellama-34b-instruct
Metrics
- Insecure coding pass rate
- Cyberattack helpfulness (compliance) rate
- ICD precision/recall
- Judge pipeline precision/recall
- BLEU code quality
Datasets
- Open-source code corpus (unspecified origins)
- CYBERSECEVAL generated test prompts (insecure-code and ATT&CK prompts)
Benchmarks
- CYBERSECEVAL
Context Entities
Models
- GitHub Copilot (cited usage)
- StarCoder (related benchmark studies)
Metrics
- Manual vulnerability labels from prior work
Datasets
- SecurityEval
- Asleep at the Keyboard
Benchmarks
- SecurityEval
- Asleep at the Keyboard
- CodeLMSec

