Find code security bugs while the developer types using transformer models

Overview

Decision SnapshotReady For Pilot

Paper combines offline benchmark gains with a live VSCode deployment and multi-model comparisons, giving moderate-to-strong evidence for practical impact.

Citations7

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 75%

Novelty: 45%

Authors

Aaron Chan, Anant Kharkar, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Alec Helyar, Eslam Kamal, Mohamed Elkamhawy, Neel Sundaresan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Catching vulnerabilities while code is being written shortens fix time and cost; the paper shows large reductions in vulnerable completions from code LMs and near-90% reduction in production JS edits when integrated into an editor.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The authors build DeepDevVuln, a transformer-based system that detects software vulnerabilities in incomplete code snippets as developers type ("EditTime"). They collect ~500K vulnerable examples from CodeQL runs on GitHub, train and compare zero-shot, few-shot, and fine-tuned variants on CodeBERT and Codex-family models, and deploy a low-latency CodeBERT-based detector in a VSCode extension. Fine-tuned CodeBERT gives the best precision/recall balance (≈59% precision, 63% recall on their PR test set). Zero-/few-shot on large instruct models give higher recall but more false positives (text-davinci-003 recall ≈78%). Filtering code-LM outputs with DeepDevVuln reduced vulnerable scenario rates—

Problem Statement

Most existing vulnerability detectors need complete, compilable code and run late (build/test time). That delay increases fix cost. The paper asks: can transformer models spot vulnerabilities in syntactically incomplete snippets while a developer is typing (EditTime)?

Main Contribution

A production-quality EditTime vulnerability detector that works on incomplete code snippets and runs with low latency.

A large training corpus derived from CodeQL runs on public GitHub (hundreds of thousands of examples) spanning seven languages and 250+ CWE types.

Key Findings

Fine-tuned CodeBERT (DeepDevVuln) has the best F1 balance on their GitHub PR test set.

NumbersPrecision 58.87%, Recall 63.00%, F1 60.87% (Table 3)

Practical UseIf you need a stable, low-FP EditTime detector, fine-tune a compact CodeBERT-style model and tune threshold for your product goals.

Evidence RefTable 3; Sec. 3.4.3

Large instruction-tuned LLMs in zero-/few-shot mode get higher recall but many false positives.

Numberstext-davinci-003 zero-shot: Recall 78%, Precision 47% (Table 3)

Practical UseUse zero-/few-shot with large models when recall is critical and you can tolerate more alerts or add downstream filters.

Evidence RefTable 3; Sec. 3.4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
DeepDevVuln (CodeBERT fine-tuned) on GitHub PR dataset	Precision 58.87%, Recall 63.00%, F1 60.87%	—	—	GitHub PR dataset (1,006 examples)	Table 3; Sec. 3.4.3	Table 3
text-davinci-003 zero-shot on GitHub PR dataset	Precision 46.99%, Recall 78.00%, F1 58.65%	—	—	GitHub PR dataset (1,006 examples)	Table 3; Sec. 3.4.3	Table 3

What To Try In 7 Days

Run CodeQL on a representative repo set to gather labeled vulnerable snippets as training seeds.

Fine-tune a compact CodeBERT-style model on CodeQL-labeled pairs and oversample positives to balance classes.

Integrate the model as an EditTime filter for code-completion outputs and tune detection threshold to ~1% positive rate for low churn.

Optimization Features

Model Optimization

Fine-tuning a compact CodeBERT trunk with a linear head for classification

System Optimization

Prompt-based zero-/few-shot runs require fetching examples per inference and increase runtime cost

Training Optimization

Oversample vulnerable examples to create 50/50 class balance per epoch

Inference Optimization

Use a small CodeBERT-base model to keep EditTime latency low (claimed <100ms scale)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/github/codeql (CodeQL queries referenced)

Data URLs

https://github.com/github/codeql (source of CodeQL queries); paper states expanded benchmark will be made available

Risks & Boundaries

Limitations

Training labels derive from CodeQL static analysis and inherit its coverage and false-negative biases.

Class imbalance and noisy labels from static analyzer scans can limit recall/precision trade-offs.

When Not To Use

When you require formal guarantees or dynamic-execution proofs of absence of vulnerabilities.

For languages or CWE types not present in the training set (coverage gap).

Failure Modes

Overreach: zero-shot models flag hypothetical future vulnerabilities and inflate false positives.

Lack of context: truncated snippets can cause misclassification when initialization or scope matters.

Core Entities

Models

CodeBERTcode-davinci-002text-davinci-003code-cushman-001CodeGen-2BCodex

Metrics

PrecisionRecallF1-ScorePositive RateVulnerability Reduction Rate

Datasets

DeepDevVuln training set (CodeQL on GitHub, ~500K vuln examples)GitHub PR dataset (1,006 examples)Pearce-style scenario benchmark (Python + expanded JS scenarios)VulDeePeckerSeVCReVealFFmpeg+Qemu

Benchmarks

VulDeePeckerSeVCReVealFFmpeg+QemuPearce et al. scenario benchmark (expanded)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuned CodeBERT (DeepDevVuln) has the best F1 balance on their GitHub PR test set.

Large instruction-tuned LLMs in zero-/few-shot mode get higher recall but many false positives.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding