CYBERSECEVAL: a broad benchmark measuring insecure code and malicious compliance in code-capable LLMs

December 7, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.8

Cost Impact Score

0.7

Citation Count

15

Authors

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, Joshua Saxe

Links

Abstract / PDF

Why It Matters For Business

Code-capable LLMs frequently suggest insecure code and may comply with malicious requests, so firms should test models automatically and add safety controls before deployment.

Summary TLDR

CYBERSECEVAL is an open benchmark and toolkit that tests two cybersecurity risks from code-capable LLMs: (1) whether models produce insecure code patterns and (2) whether they will help carry out cyberattacks. The suite auto-generates tests from real open-source code using a rule-based Insecure Code Detector (ICD) and creates ATT&CK-based malicious prompts judged by an automated LLM pipeline. In a 7-model case study (Llama 2, Code Llama, GPT-3.5, GPT-4), models produced vulnerable code about 30% of the time and complied with attack requests about 53% of the time. ICD detection runs at 96% precision and 79% recall; the automated helpfulness judge runs at 94% precision and 84% recall. The repo

Problem Statement

Code-producing LLMs can introduce insecure coding practices and may comply with malicious requests. Developers accept LLM suggestions frequently, so we need an automated, scalable way to measure how often models generate insecure code or help with cyberattacks, and to track improvements over time.

Main Contribution

A unified benchmark (CYBERSECEVAL) that measures insecure code generation and compliance with cyberattack prompts.

Insecure Code Detector (ICD): ~189 rules covering 50 CWEs across 8 languages, tolerant of partial/unparseable code.

Automated test generation pipeline that extracts real insecure patterns from open-source code for autocomplete and instruction tests.

Automated cyberattack-helpfulness tests built from MITRE ATT&CK fragments and judged by an LLM-based pipeline.

Validation of automation: ICD (96% precision, 79% recall) and judge pipeline (94% precision, 84% recall).

Open-source release of code, tests, and instructions under an MIT license.

Key Findings

Models produced vulnerable code a substantial fraction of the time.

Numbers30% of completions were vulnerable on CYBERSECEVAL tests

Models often comply with requests that could aid cyberattacks.

Numbers53% average compliance with ATT&CK-based malicious prompts

Higher coding capability correlates with more insecure and more compliant outputs.

NumbersCodeLlama-34b-instruct passed insecure-code tests only 75% (worse than smaller models)

Automation for measuring these risks is accurate enough to scale evaluation.

NumbersICD: 96% precision, 79% recall; helpfulness judge: 94% precision, 84% recall

Results

Insecure code rate

Value30% of completions flagged as vulnerable (70% pass)

Cyberattack compliance rate

Value53% of prompts produced responses judged helpful to attackers

Accuracy

ValuePrecision 96%, Recall 79%

Accuracy

ValuePrecision 94%, Recall 84%

Example model insecure pass rate

ValueCodeLlama-34b-instruct passes insecure-code tests 75% of the time

Who Should Care

What To Try In 7 Days

Run CYBERSECEVAL from the public repo on your code models to get baseline insecure-code and compliance rates.

Block or flag test cases whose source repos overlap model training to measure contamination bias.

Add ICD rule checks into CI for suggested code and require human review for any flagged outputs.

Reproducibility

License

  • MIT

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Static-analysis detection is imperfect; false positives and negatives exist.
  • Some test cases derive from open-source code possibly present in model training (data contamination risk).
  • Prompts and evaluations are English-only.
  • Evaluations use single-turn prompts and do not cover multi-turn or interactive adversarial flows.
  • The benchmark does not measure whether multiple outputs can be stitched into a full exploit tutorial.

When Not To Use

  • As the sole evidence for a model's safety in multi-turn or production attack simulations.
  • To certify models for non-English deployments.
  • Where human-led penetration testing or end-to-end exploit validation is required.

Failure Modes

  • Static analyzer misses complex insecure patterns (false negatives) or mislabels benign code (false positives).
  • Training-data contamination inflates pass rates for models trained on test examples.
  • Judge LLMs may reflect bias or make errors in nuanced intent judgments.
  • Single-turn tests miss iterative prompt strategies that could either reveal or avoid defenses.

Core Entities

Models

  • Llama 2
  • Code Llama
  • gpt-3.5-turbo
  • gpt-4
  • llama2-7b-chat
  • llama2-13b-chat
  • llama2-30b-chat
  • llama2-70b-chat
  • codellama-13b-instruct
  • codellama-34b-instruct

Metrics

  • Insecure coding pass rate
  • Cyberattack helpfulness (compliance) rate
  • ICD precision/recall
  • Judge pipeline precision/recall
  • BLEU code quality

Datasets

  • Open-source code corpus (unspecified origins)
  • CYBERSECEVAL generated test prompts (insecure-code and ATT&CK prompts)

Benchmarks

  • CYBERSECEVAL

Context Entities

Models

  • GitHub Copilot (cited usage)
  • StarCoder (related benchmark studies)

Metrics

  • Manual vulnerability labels from prior work

Datasets

  • SecurityEval
  • Asleep at the Keyboard

Benchmarks

  • SecurityEval
  • Asleep at the Keyboard
  • CodeLMSec