Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
SecCodeBench-V2 gives a realistic, reproducible way to compare LLMs on both usable code and real exploitability. It helps teams pick models, measure regressions, and focus efforts where security failures matter most.
Summary TLDR
SecCodeBench-V2 is a public benchmark and evaluation pipeline for measuring whether LLM-powered coding assistants generate or repair secure code. It contains 98 real, de-identified vulnerability cases across Java, C/C++, Python, Go, and JavaScript. Each case provides a full project scaffold, functional tests, and security PoC tests run in Docker sandboxes. For semantics-heavy checks the framework uses an LLM-as-a-judge. Scoring uses Pass@K (default K=1) with severity- and scenario-aware weighting. The dataset and tooling are released on GitHub.
Problem Statement
Existing secure-code benchmarks are often small, prompt-centric, and rely on static checks or public examples that risk contamination and poor realism. They commonly collapse usability and security into coarse metrics, miss runtime-only exploits, and lack scenario-aware scoring needed for enterprise decisions.
Main Contribution
A public benchmark (SecCodeBench-V2) of 98 function-level secure-code tasks derived from de-identified real internal vulnerabilities.
An execution-driven evaluation pipeline that compiles and runs model outputs inside Docker sandboxes and executes PoC tests for both functionality and exploitability.
A hybrid adjudication method: deterministic unit tests where possible plus LLM-as-a-judge (majority vote) for semantics-heavy cases.
A Pass@K-based scoring protocol (default K=1) with principled severity (Critical/High/Medium) and scenario (gen/fix and hint variants) weighting.
Full project templates, per-case functional and security tests, and a modular controller for multi-round evaluation, logging, and reproducibility. Public release on GitHub.
Key Findings
Benchmark size and scope
Severity distribution and weighted scoring
Execution-first evaluation with fallbacks
Multi-scenario prompts and protocol
Who Should Care
What To Try In 7 Days
Run the benchmark's gen and fix scenarios on your model or API to spot language-specific blind spots.
Prioritize fixes for cases that fail Critical-weighted tests (weights 4) to reduce high-impact risk quickly.
Add your own PoC tests for organization-specific threats and plug them into the provided pipeline (Docker-ready).
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Passing PoC tests does not prove absence of all vulnerabilities; untested inputs may still exploit code.
- LLM-as-a-judge can introduce semantic bias and depends on judge-model quality and prompts.
- De-identification reduces but does not eliminate possible data-contamination risk for public LLMs.
- Coverage depends on chosen PoCs; rare or chained exploit paths may be missed.
When Not To Use
- When you need formal verification or proof of absence of vulnerabilities.
- For repository-level, multi-file patch workflows that require broader context than single-function tasks.
- If you need metrics that capture run-time traffic, load tests, or long-term runtime behavior beyond PoCs.
Failure Modes
- Model produces functionally incorrect code that appears 'safe' under static checks but fails runtime behavior.
- Judge LLM panel disagrees or is systematically biased, causing noisy security labels.
- Environment differences cause tests to fail or pass spuriously (e.g., missing system binaries).
Core Entities
Metrics
- Pass@K
- Pass@1
- severity-weighted score
- unweighted score
Datasets
- SecCodeBench-V2
Context Entities
Metrics
- CVSS
Datasets
- ZeroSecBench
- SecureAgentBench
- A.S.E
- SafeGenBench
- DUALGUAGE
- PATCHEVAL
Benchmarks
- CodeLMSec

