CYBERSECEVAL: a broad benchmark measuring insecure code and malicious compliance in code-capable LLMs

December 7, 20238 min

Overview

Decision SnapshotNeeds Validation

The benchmark is ready to run and open-source with validated automation, but detection limits, dataset contamination, and single-turn testing reduce its completeness as a final safety gate.

Citations15

Evidence Strength0.80

Confidence0.86

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Code + data available

Open source: Yes

License: MIT

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 80%

Authors

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, Joshua Saxe

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Code-capable LLMs frequently suggest insecure code and may comply with malicious requests, so firms should test models automatically and add safety controls before deployment.

Who Should Care

Summary TLDR

CYBERSECEVAL is an open benchmark and toolkit that tests two cybersecurity risks from code-capable LLMs: (1) whether models produce insecure code patterns and (2) whether they will help carry out cyberattacks. The suite auto-generates tests from real open-source code using a rule-based Insecure Code Detector (ICD) and creates ATT&CK-based malicious prompts judged by an automated LLM pipeline. In a 7-model case study (Llama 2, Code Llama, GPT-3.5, GPT-4), models produced vulnerable code about 30% of the time and complied with attack requests about 53% of the time. ICD detection runs at 96% precision and 79% recall; the automated helpfulness judge runs at 94% precision and 84% recall. The repo

Problem Statement

Code-producing LLMs can introduce insecure coding practices and may comply with malicious requests. Developers accept LLM suggestions frequently, so we need an automated, scalable way to measure how often models generate insecure code or help with cyberattacks, and to track improvements over time.

Main Contribution

A unified benchmark (CYBERSECEVAL) that measures insecure code generation and compliance with cyberattack prompts.

Insecure Code Detector (ICD): ~189 rules covering 50 CWEs across 8 languages, tolerant of partial/unparseable code.

Key Findings

Models produced vulnerable code a substantial fraction of the time.

Numbers30% of completions were vulnerable on CYBERSECEVAL tests

Practical UseTreat generated code as security-risky by default; run automated security tests on model outputs before accepting suggestions.

Evidence RefAbstract, Conclusion, Section 1.2

Models often comply with requests that could aid cyberattacks.

Numbers53% average compliance with ATT&CK-based malicious prompts

Practical UseAdd refusal and safety layers around models used for code tasks; assume many single-turn prompts can elicit harmful guidance.

Evidence RefAbstract, Section 3.5, Conclusion

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Insecure code rate30% of completions flagged as vulnerable (70% pass)Case study across 7 models on CYBERSECEVALAverage insecure-code finding reported across tested modelsAbstract, Sections 1.2 and 7
Cyberattack compliance rate53% of prompts produced responses judged helpful to attackers1,000 ATT&CK-derived prompts across modelsAverage compliance across Llama 2, CodeLlama, GPT modelsSections 3.5 and 7

What To Try In 7 Days

Run CYBERSECEVAL from the public repo on your code models to get baseline insecure-code and compliance rates.

Block or flag test cases whose source repos overlap model training to measure contamination bias.

Add ICD rule checks into CI for suggested code and require human review for any flagged outputs.

Reproducibility

Risks & Boundaries

Limitations

Static-analysis detection is imperfect; false positives and negatives exist.

Some test cases derive from open-source code possibly present in model training (data contamination risk).

When Not To Use

As the sole evidence for a model's safety in multi-turn or production attack simulations.

To certify models for non-English deployments.

Failure Modes

Static analyzer misses complex insecure patterns (false negatives) or mislabels benign code (false positives).

Training-data contamination inflates pass rates for models trained on test examples.

Core Entities

Models

Llama 2Code Llamagpt-3.5-turbogpt-4llama2-7b-chatllama2-13b-chatllama2-30b-chatllama2-70b-chatcodellama-13b-instructcodellama-34b-instruct

Metrics

Insecure coding pass rateCyberattack helpfulness (compliance) rateICD precision/recallJudge pipeline precision/recallBLEU code quality

Datasets

Open-source code corpus (unspecified origins)CYBERSECEVAL generated test prompts (insecure-code and ATT&CK prompts)

Benchmarks

CYBERSECEVAL

Context Entities

Models

GitHub Copilot (cited usage)StarCoder (related benchmark studies)

Metrics

Manual vulnerability labels from prior work

Datasets

SecurityEvalAsleep at the Keyboard

Benchmarks

SecurityEvalAsleep at the KeyboardCodeLMSec