Overview
Sallm provides a practical dataset, runnable tests, and new metrics, but it covers only Python and relies on CodeQL (static analysis) plus unit tests; findings are well supported within those bounds.
Citations6
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 45%
Production readiness: 60%
Novelty: 65%
Why It Matters For Business
Generated code can be functionally correct but insecure; automated security benchmarks reveal trade-offs between correctness and security so teams can pick models and pipelines that match risk tolerance.
Who Should Care
Summary TLDR
Sallm is a framework to benchmark how well code-generating LLMs produce secure Python code. It ships a curated dataset of 100 Python prompts (45 CWE types), runnable tests in Docker, a rule-based repair step that raises compilation from ~15% to ~75%, static (CodeQL) and dynamic (unit tests) assessments, and two new metrics: secure@k (all top-k are vulnerability-free) and vulnerable@k (at least one vulnerable sample in top-k). Experiments on five models (CodeGen variants, StarCoder, GPT-3.5, GPT-4) show trade-offs: GPT-4 is best at functional correctness (pass@k up to ~55%) but not the most secure; CodeGen-2.5-7B strikes the best correctness/security balance; StarCoder produces fewer flaggedv
Problem Statement
Current code-generation benchmarks focus on functional correctness and use prompts that do not reflect security-sensitive engineering tasks. Evaluation metrics ignore security, so models may appear good while producing exploitable code when integrated into real projects.
Main Contribution
A framework (Sallm) that automates security-focused benchmarking for code LLMs.
A curated dataset of 100 Python prompts labeled with CWE IDs and runnable insecure examples.
Key Findings
Repair component greatly increases executability of model outputs.
Functional correctness (pass@k) varies widely across models and temperatures.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 100 Python prompts | SecurityEval 121 prompts; LLMSecEval 150 (83 Python) | — | — | Section 3.1; Table 1 | Table 1 |
| CWE coverage | 45 CWE types | LLMSecEval 18 CWEs; SecurityEval 69 CWEs | 2.5× more than LLMSecEval | — | Section 5.1.1; Table 1 | Table 1 |
What To Try In 7 Days
Run Sallm (or similar) on your top models to measure secure@k and pass@k on representative prompts
Add a simple repair step to post-process model outputs and increase runnable rates before tests
Combine static analysis (CodeQL) and unit tests in sandboxed Docker to catch common vulnerabilities automatically
Reproducibility
Risks & Boundaries
Limitations
Dataset covers only Python prompts, so results may not generalize to other languages
Static analysis (CodeQL) can report false positives/negatives; dynamic tests depend on test coverage
When Not To Use
When you need multi-language security benchmarking beyond Python
For large integrated codebases where single-file unit tests cannot model system interactions
Failure Modes
Rule-based repair can mask generation issues or create syntactically valid but logically incorrect code
Static analyzer misses novel vulnerability patterns or flags benign constructs

