Execution-driven, real-world benchmark for secure code generation across 5 languages

Overview

Decision SnapshotReady For Pilot

The benchmark is engineered for production-like evaluation (Docker, real cases, severity weighting). It is not a formal verification tool and relies on curated PoCs and LLM-based judging for some cases.

Citations0

Evidence Strength0.75

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/0

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 40%

Authors

Longfei Chen, Ji Zhao, Lanxiao Cui, Tong Su, Xingbo Pan, Ziyang Li, Yongxing Wu, Qijiang Cao, Qiyao Cai, Jing Zhang, Yuandong Ni, Junyao He, Zeyu Zhang, Chao Ge, Xuhuai Lu, Zeyu Gao, Yuxin Cui, Weisen Chen, Yuxuan Peng, Shengping Wang, Qi Li, Yukai Huang, Yukun Liu, Tuo Zhou, Terry Yue Zhuo, Junyang Lin, Chao Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SecCodeBench-V2 gives a realistic, reproducible way to compare LLMs on both usable code and real exploitability. It helps teams pick models, measure regressions, and focus efforts where security failures matter most.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

SecCodeBench-V2 is a public benchmark and evaluation pipeline for measuring whether LLM-powered coding assistants generate or repair secure code. It contains 98 real, de-identified vulnerability cases across Java, C/C++, Python, Go, and JavaScript. Each case provides a full project scaffold, functional tests, and security PoC tests run in Docker sandboxes. For semantics-heavy checks the framework uses an LLM-as-a-judge. Scoring uses Pass@K (default K=1) with severity- and scenario-aware weighting. The dataset and tooling are released on GitHub.

Problem Statement

Existing secure-code benchmarks are often small, prompt-centric, and rely on static checks or public examples that risk contamination and poor realism. They commonly collapse usability and security into coarse metrics, miss runtime-only exploits, and lack scenario-aware scoring needed for enterprise decisions.

Main Contribution

A public benchmark (SecCodeBench-V2) of 98 function-level secure-code tasks derived from de-identified real internal vulnerabilities.

An execution-driven evaluation pipeline that compiles and runs model outputs inside Docker sandboxes and executes PoC tests for both functionality and exploitability.

Key Findings

Benchmark size and scope

Numbers98 cases; 22 CWE types; 5 languages

Practical UseUse this benchmark to test cross-language secure generation and repair on realistic, diverse industrial issues.

Evidence Refabstract, §4.2, Table 4

Severity distribution and weighted scoring

NumbersCritical 34; High 49; Medium 15; weights = 4/2/1

Practical UseScoring emphasizes critical vulnerabilities — prioritize fixing critical-case failures when evaluating models for production.

Evidence RefTable 3; §5

What To Try In 7 Days

Run the benchmark's gen and fix scenarios on your model or API to spot language-specific blind spots.

Prioritize fixes for cases that fail Critical-weighted tests (weights 4) to reduce high-impact risk quickly.

Add your own PoC tests for organization-specific threats and plug them into the provided pipeline (Docker-ready).

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/alibaba/sec-code-bench https://alibaba.github.io/sec-code-bench

Data URLs

https://github.com/alibaba/sec-code-bench

Risks & Boundaries

Limitations

Passing PoC tests does not prove absence of all vulnerabilities; untested inputs may still exploit code.

LLM-as-a-judge can introduce semantic bias and depends on judge-model quality and prompts.

When Not To Use

When you need formal verification or proof of absence of vulnerabilities.

For repository-level, multi-file patch workflows that require broader context than single-function tasks.

Failure Modes

Model produces functionally incorrect code that appears 'safe' under static checks but fails runtime behavior.

Judge LLM panel disagrees or is systematically biased, causing noisy security labels.

Core Entities

Metrics

Pass@KPass@1severity-weighted scoreunweighted score

Datasets

SecCodeBench-V2

Context Entities

Metrics

CVSS

Datasets

ZeroSecBenchSecureAgentBenchA.S.ESafeGenBenchDUALGUAGEPATCHEVAL

Benchmarks

CodeLMSec

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Benchmark size and scope

Severity distribution and weighted scoring

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Metrics

Datasets

Context Entities

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding

Plot2Code: a focused benchmark that asks multimodal LLMs to generate matplotlib code from scientific plots

Key finding