Sallm: an automated benchmark, dataset, and metrics for measuring code-security of LLMs

Overview

Decision SnapshotNeeds Validation

Sallm provides a practical dataset, runnable tests, and new metrics, but it covers only Python and relies on CodeQL (static analysis) plus unit tests; findings are well supported within those bounds.

Citations6

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 45%

Production readiness: 60%

Novelty: 65%

Authors

Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, Anna Muller

Links

Abstract / PDF

Why It Matters For Business

Generated code can be functionally correct but insecure; automated security benchmarks reveal trade-offs between correctness and security so teams can pick models and pipelines that match risk tolerance.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager Data Scientist

Summary TLDR

Sallm is a framework to benchmark how well code-generating LLMs produce secure Python code. It ships a curated dataset of 100 Python prompts (45 CWE types), runnable tests in Docker, a rule-based repair step that raises compilation from ~15% to ~75%, static (CodeQL) and dynamic (unit tests) assessments, and two new metrics: secure@k (all top-k are vulnerability-free) and vulnerable@k (at least one vulnerable sample in top-k). Experiments on five models (CodeGen variants, StarCoder, GPT-3.5, GPT-4) show trade-offs: GPT-4 is best at functional correctness (pass@k up to ~55%) but not the most secure; CodeGen-2.5-7B strikes the best correctness/security balance; StarCoder produces fewer flaggedv

Problem Statement

Current code-generation benchmarks focus on functional correctness and use prompts that do not reflect security-sensitive engineering tasks. Evaluation metrics ignore security, so models may appear good while producing exploitable code when integrated into real projects.

Main Contribution

A framework (Sallm) that automates security-focused benchmarking for code LLMs.

A curated dataset of 100 Python prompts labeled with CWE IDs and runnable insecure examples.

Key Findings

Repair component greatly increases executability of model outputs.

Numberscompilation rate from 15% to 75% (avg); GPT-4 from <1% to 89%

Practical UseAdd simple post-processing rules to fix formatting and missing prompt headers before testing to recover most runnable outputs.

Evidence RefSection 5.2.1 Fig.3

Functional correctness (pass@k) varies widely across models and temperatures.

Numberspass@k ranged 5.5%–54.8% across models/temps

Practical UseDo not assume a high-quality model by default; measure pass@k on your prompts and tune temperature for your use case.

Evidence RefSection 5.2.2 Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	100 Python prompts	SecurityEval 121 prompts; LLMSecEval 150 (83 Python)	—	—	Section 3.1; Table 1	Table 1
CWE coverage	45 CWE types	LLMSecEval 18 CWEs; SecurityEval 69 CWEs	2.5× more than LLMSecEval	—	Section 5.1.1; Table 1	Table 1

What To Try In 7 Days

Run Sallm (or similar) on your top models to measure secure@k and pass@k on representative prompts

Add a simple repair step to post-process model outputs and increase runnable rates before tests

Combine static analysis (CodeQL) and unit tests in sandboxed Docker to catch common vulnerabilities automatically

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Dataset covers only Python prompts, so results may not generalize to other languages

Static analysis (CodeQL) can report false positives/negatives; dynamic tests depend on test coverage

When Not To Use

When you need multi-language security benchmarking beyond Python

For large integrated codebases where single-file unit tests cannot model system interactions

Failure Modes

Rule-based repair can mask generation issues or create syntactically valid but logically incorrect code

Static analyzer misses novel vulnerability patterns or flags benign constructs

Core Entities

Models

CodeGen-2B-monoCodeGen-2.5-7B-monoStarCoderGPT-3.5-TurboGPT-4

Metrics

pass@ksecure@kvulnerable@k

Datasets

Sallm dataset (100 Python prompts)SecurityEvalLLMSecEvalHumanEval

Benchmarks

Sallm

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Repair component greatly increases executability of model outputs.

Functional correctness (pass@k) varies widely across models and temperatures.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding