Execution-driven, real-world benchmark for secure code generation across 5 languages

February 17, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

0

Authors

Longfei Chen, Ji Zhao, Lanxiao Cui, Tong Su, Xingbo Pan, Ziyang Li, Yongxing Wu, Qijiang Cao, Qiyao Cai, Jing Zhang, Yuandong Ni, Junyao He, Zeyu Zhang, Chao Ge, Xuhuai Lu, Zeyu Gao, Yuxin Cui, Weisen Chen, Yuxuan Peng, Shengping Wang, Qi Li, Yukai Huang, Yukun Liu, Tuo Zhou, Terry Yue Zhuo, Junyang Lin, Chao Zhang

Links

Abstract / PDF

Why It Matters For Business

SecCodeBench-V2 gives a realistic, reproducible way to compare LLMs on both usable code and real exploitability. It helps teams pick models, measure regressions, and focus efforts where security failures matter most.

Summary TLDR

SecCodeBench-V2 is a public benchmark and evaluation pipeline for measuring whether LLM-powered coding assistants generate or repair secure code. It contains 98 real, de-identified vulnerability cases across Java, C/C++, Python, Go, and JavaScript. Each case provides a full project scaffold, functional tests, and security PoC tests run in Docker sandboxes. For semantics-heavy checks the framework uses an LLM-as-a-judge. Scoring uses Pass@K (default K=1) with severity- and scenario-aware weighting. The dataset and tooling are released on GitHub.

Problem Statement

Existing secure-code benchmarks are often small, prompt-centric, and rely on static checks or public examples that risk contamination and poor realism. They commonly collapse usability and security into coarse metrics, miss runtime-only exploits, and lack scenario-aware scoring needed for enterprise decisions.

Main Contribution

A public benchmark (SecCodeBench-V2) of 98 function-level secure-code tasks derived from de-identified real internal vulnerabilities.

An execution-driven evaluation pipeline that compiles and runs model outputs inside Docker sandboxes and executes PoC tests for both functionality and exploitability.

A hybrid adjudication method: deterministic unit tests where possible plus LLM-as-a-judge (majority vote) for semantics-heavy cases.

A Pass@K-based scoring protocol (default K=1) with principled severity (Critical/High/Medium) and scenario (gen/fix and hint variants) weighting.

Full project templates, per-case functional and security tests, and a modular controller for multi-round evaluation, logging, and reproducibility. Public release on GitHub.

Key Findings

Benchmark size and scope

Numbers98 cases; 22 CWE types; 5 languages

Severity distribution and weighted scoring

NumbersCritical 34; High 49; Medium 15; weights = 4/2/1

Execution-first evaluation with fallbacks

NumbersDynamic PoC tests in Docker; LLM-as-a-judge for semantics-heavy cases

Multi-scenario prompts and protocol

Numbers4 scenarios per case: gen, gen-hints, fix, fix-hints; retry r=3; R=10 rounds; Pass@K with K=1

Who Should Care

What To Try In 7 Days

Run the benchmark's gen and fix scenarios on your model or API to spot language-specific blind spots.

Prioritize fixes for cases that fail Critical-weighted tests (weights 4) to reduce high-impact risk quickly.

Add your own PoC tests for organization-specific threats and plug them into the provided pipeline (Docker-ready).

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Passing PoC tests does not prove absence of all vulnerabilities; untested inputs may still exploit code.
  • LLM-as-a-judge can introduce semantic bias and depends on judge-model quality and prompts.
  • De-identification reduces but does not eliminate possible data-contamination risk for public LLMs.
  • Coverage depends on chosen PoCs; rare or chained exploit paths may be missed.

When Not To Use

  • When you need formal verification or proof of absence of vulnerabilities.
  • For repository-level, multi-file patch workflows that require broader context than single-function tasks.
  • If you need metrics that capture run-time traffic, load tests, or long-term runtime behavior beyond PoCs.

Failure Modes

  • Model produces functionally incorrect code that appears 'safe' under static checks but fails runtime behavior.
  • Judge LLM panel disagrees or is systematically biased, causing noisy security labels.
  • Environment differences cause tests to fail or pass spuriously (e.g., missing system binaries).

Core Entities

Metrics

  • Pass@K
  • Pass@1
  • severity-weighted score
  • unweighted score

Datasets

  • SecCodeBench-V2

Context Entities

Metrics

  • CVSS

Datasets

  • ZeroSecBench
  • SecureAgentBench
  • A.S.E
  • SafeGenBench
  • DUALGUAGE
  • PATCHEVAL

Benchmarks

  • CodeLMSec