Sallm: an automated benchmark, dataset, and metrics for measuring code-security of LLMs

November 1, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.45

Citation Count

6

Authors

Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, Anna Muller

Links

Abstract / PDF

Why It Matters For Business

Generated code can be functionally correct but insecure; automated security benchmarks reveal trade-offs between correctness and security so teams can pick models and pipelines that match risk tolerance.

Summary TLDR

Sallm is a framework to benchmark how well code-generating LLMs produce secure Python code. It ships a curated dataset of 100 Python prompts (45 CWE types), runnable tests in Docker, a rule-based repair step that raises compilation from ~15% to ~75%, static (CodeQL) and dynamic (unit tests) assessments, and two new metrics: secure@k (all top-k are vulnerability-free) and vulnerable@k (at least one vulnerable sample in top-k). Experiments on five models (CodeGen variants, StarCoder, GPT-3.5, GPT-4) show trade-offs: GPT-4 is best at functional correctness (pass@k up to ~55%) but not the most secure; CodeGen-2.5-7B strikes the best correctness/security balance; StarCoder produces fewer flaggedv

Problem Statement

Current code-generation benchmarks focus on functional correctness and use prompts that do not reflect security-sensitive engineering tasks. Evaluation metrics ignore security, so models may appear good while producing exploitable code when integrated into real projects.

Main Contribution

A framework (Sallm) that automates security-focused benchmarking for code LLMs.

A curated dataset of 100 Python prompts labeled with CWE IDs and runnable insecure examples.

Automated assessment combining static analysis (CodeQL) and dynamic testing in Docker.

Two new metrics: secure@k and vulnerable@k plus a repair step to improve compilation rates.

Key Findings

Repair component greatly increases executability of model outputs.

Numberscompilation rate from 15% to 75% (avg); GPT-4 from <1% to 89%

Functional correctness (pass@k) varies widely across models and temperatures.

Numberspass@k ranged 5.5%–54.8% across models/temps

Security scores and correctness do not align; best at one may not be best at the other.

Numbersvulnerable@k ranged ~16%–59%; GPT-4 best for correctness but worse for secure balance

Model trade-offs emerged: CodeGen-2.5-7B balanced security and correctness best on these prompts.

NumbersCodeGen-2.5-7B had top harmonic mean of pass@k and secure@k in experiments

StarCoder produced fewer flagged vulnerabilities but had low functional correctness.

NumbersStarCoder pass@k average ~15.5%; lower vulnerable@k than many models

Results

Dataset size

Value100 Python prompts

BaselineSecurityEval 121 prompts; LLMSecEval 150 (83 Python)

CWE coverage

Value45 CWE types

BaselineLLMSecEval 18 CWEs; SecurityEval 69 CWEs

Compilation / executability

Valueavg compilation 75% after repair

Baseline15% before repair

Functional correctness (pass@k)

Value5.5%–54.8% (varies by model & temp)

Vulnerability rates (vulnerable@k)

Value16%–59% across static/test assessments

Best correctness-security balance

ValueCodeGen-2.5-7B

BaselineGPT-4 best correctness

Who Should Care

What To Try In 7 Days

Run Sallm (or similar) on your top models to measure secure@k and pass@k on representative prompts

Add a simple repair step to post-process model outputs and increase runnable rates before tests

Combine static analysis (CodeQL) and unit tests in sandboxed Docker to catch common vulnerabilities automatically

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset covers only Python prompts, so results may not generalize to other languages
  • Static analysis (CodeQL) can report false positives/negatives; dynamic tests depend on test coverage
  • Prompts were manually crafted by the authors; selection and mapping to CWEs may reflect author choices

When Not To Use

  • When you need multi-language security benchmarking beyond Python
  • For large integrated codebases where single-file unit tests cannot model system interactions
  • If you require formal proofs of absence of vulnerabilities

Failure Modes

  • Rule-based repair can mask generation issues or create syntactically valid but logically incorrect code
  • Static analyzer misses novel vulnerability patterns or flags benign constructs
  • Unit tests may not cover all attack vectors, giving false sense of security

Core Entities

Models

  • CodeGen-2B-mono
  • CodeGen-2.5-7B-mono
  • StarCoder
  • GPT-3.5-Turbo
  • GPT-4

Metrics

  • pass@k
  • secure@k
  • vulnerable@k

Datasets

  • Sallm dataset (100 Python prompts)
  • SecurityEval
  • LLMSecEval
  • HumanEval

Benchmarks

  • Sallm