Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.45
Citation Count
6
Why It Matters For Business
Generated code can be functionally correct but insecure; automated security benchmarks reveal trade-offs between correctness and security so teams can pick models and pipelines that match risk tolerance.
Summary TLDR
Sallm is a framework to benchmark how well code-generating LLMs produce secure Python code. It ships a curated dataset of 100 Python prompts (45 CWE types), runnable tests in Docker, a rule-based repair step that raises compilation from ~15% to ~75%, static (CodeQL) and dynamic (unit tests) assessments, and two new metrics: secure@k (all top-k are vulnerability-free) and vulnerable@k (at least one vulnerable sample in top-k). Experiments on five models (CodeGen variants, StarCoder, GPT-3.5, GPT-4) show trade-offs: GPT-4 is best at functional correctness (pass@k up to ~55%) but not the most secure; CodeGen-2.5-7B strikes the best correctness/security balance; StarCoder produces fewer flaggedv
Problem Statement
Current code-generation benchmarks focus on functional correctness and use prompts that do not reflect security-sensitive engineering tasks. Evaluation metrics ignore security, so models may appear good while producing exploitable code when integrated into real projects.
Main Contribution
A framework (Sallm) that automates security-focused benchmarking for code LLMs.
A curated dataset of 100 Python prompts labeled with CWE IDs and runnable insecure examples.
Automated assessment combining static analysis (CodeQL) and dynamic testing in Docker.
Two new metrics: secure@k and vulnerable@k plus a repair step to improve compilation rates.
Key Findings
Repair component greatly increases executability of model outputs.
Functional correctness (pass@k) varies widely across models and temperatures.
Security scores and correctness do not align; best at one may not be best at the other.
Model trade-offs emerged: CodeGen-2.5-7B balanced security and correctness best on these prompts.
StarCoder produced fewer flagged vulnerabilities but had low functional correctness.
Results
Dataset size
CWE coverage
Compilation / executability
Functional correctness (pass@k)
Vulnerability rates (vulnerable@k)
Best correctness-security balance
Who Should Care
What To Try In 7 Days
Run Sallm (or similar) on your top models to measure secure@k and pass@k on representative prompts
Add a simple repair step to post-process model outputs and increase runnable rates before tests
Combine static analysis (CodeQL) and unit tests in sandboxed Docker to catch common vulnerabilities automatically
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset covers only Python prompts, so results may not generalize to other languages
- Static analysis (CodeQL) can report false positives/negatives; dynamic tests depend on test coverage
- Prompts were manually crafted by the authors; selection and mapping to CWEs may reflect author choices
When Not To Use
- When you need multi-language security benchmarking beyond Python
- For large integrated codebases where single-file unit tests cannot model system interactions
- If you require formal proofs of absence of vulnerabilities
Failure Modes
- Rule-based repair can mask generation issues or create syntactically valid but logically incorrect code
- Static analyzer misses novel vulnerability patterns or flags benign constructs
- Unit tests may not cover all attack vectors, giving false sense of security
Core Entities
Models
- CodeGen-2B-mono
- CodeGen-2.5-7B-mono
- StarCoder
- GPT-3.5-Turbo
- GPT-4
Metrics
- pass@k
- secure@k
- vulnerable@k
Datasets
- Sallm dataset (100 Python prompts)
- SecurityEval
- LLMSecEval
- HumanEval
Benchmarks
- Sallm

