Find code security bugs while the developer types using transformer models

May 23, 20239 min

Overview

Production Readiness

0.75

Novelty Score

0.45

Cost Impact Score

0.7

Citation Count

7

Authors

Aaron Chan, Anant Kharkar, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Alec Helyar, Eslam Kamal, Mohamed Elkamhawy, Neel Sundaresan

Links

Abstract / PDF

Why It Matters For Business

Catching vulnerabilities while code is being written shortens fix time and cost; the paper shows large reductions in vulnerable completions from code LMs and near-90% reduction in production JS edits when integrated into an editor.

Summary TLDR

The authors build DeepDevVuln, a transformer-based system that detects software vulnerabilities in incomplete code snippets as developers type ("EditTime"). They collect ~500K vulnerable examples from CodeQL runs on GitHub, train and compare zero-shot, few-shot, and fine-tuned variants on CodeBERT and Codex-family models, and deploy a low-latency CodeBERT-based detector in a VSCode extension. Fine-tuned CodeBERT gives the best precision/recall balance (≈59% precision, 63% recall on their PR test set). Zero-/few-shot on large instruct models give higher recall but more false positives (text-davinci-003 recall ≈78%). Filtering code-LM outputs with DeepDevVuln reduced vulnerable scenario rates—

Problem Statement

Most existing vulnerability detectors need complete, compilable code and run late (build/test time). That delay increases fix cost. The paper asks: can transformer models spot vulnerabilities in syntactically incomplete snippets while a developer is typing (EditTime)?

Main Contribution

A production-quality EditTime vulnerability detector that works on incomplete code snippets and runs with low latency.

A large training corpus derived from CodeQL runs on public GitHub (hundreds of thousands of examples) spanning seven languages and 250+ CWE types.

An empirical comparison of zero-shot, few-shot, and fine-tuning strategies on CodeBERT and Codex-family models.

An expanded scenario benchmark for evaluating vulnerabilities in model-generated code and a deployed VSCode extension with live telemetry.

Key Findings

Fine-tuned CodeBERT (DeepDevVuln) has the best F1 balance on their GitHub PR test set.

NumbersPrecision 58.87%, Recall 63.00%, F1 60.87% (Table 3)

Large instruction-tuned LLMs in zero-/few-shot mode get higher recall but many false positives.

Numberstext-davinci-003 zero-shot: Recall 78%, Precision 47% (Table 3)

Filtering code-LM completions with the detector cuts scenario-level vulnerability rates dramatically.

Numberstext-davinci-003 scenarios: 21→2 vulnerable (89.74% reduction) (Table 7)

Production deployment on VSCode showed large vulnerability reduction on real JS edits.

NumbersObserved 89.64% vulnerability reduction on JavaScript snippets (6.7M snippets observed) (Sec. 5)

The authors assembled a large, imbalanced dataset from CodeQL over GitHub covering multiple languages.

NumbersJavascript: 266,342 vulnerable / 2,293,712 non-vulnerable; total training set >500K vulnerable examples (Table 1, Sec. 3

Fine-tuned Codex (CodexVuln) trades precision for recall differently than CodeBERT.

NumbersCodexVuln: Precision 69.56%, Recall 48.00% (Table 3)

Results

DeepDevVuln (CodeBERT fine-tuned) on GitHub PR dataset

ValuePrecision 58.87%, Recall 63.00%, F1 60.87%

text-davinci-003 zero-shot on GitHub PR dataset

ValuePrecision 46.99%, Recall 78.00%, F1 58.65%

CodexZero (code-davinci-002 zero-shot) on GitHub PR dataset

ValuePrecision 11.08%, Recall 98.00%, F1 19.90%

Vulnerability reduction of code-LM scenarios after filtering with DeepDevVuln

Valuetext-davinci-003: 21→2 vulnerable scenarios (89.74% reduction)

BaselineBefore filtering: 21 vulnerable scenarios (78% of valid)

Production JS telemetry: detected impact

ValueObserved vulnerability reduction rate 89.64% on JavaScript EditTime traffic (6.7M snippets)

BaselineCodeQL detections used as lower-bound

Model vs prior SOTA on benchmarks (claimed)

ValueImproves recall by ~10% and precision by ~8% on evaluated benchmarks

BaselinePrior state-of-the-art models on VulDeePecker/SeVC/ReVeal/FFmpeg+Qemu

Who Should Care

What To Try In 7 Days

Run CodeQL on a representative repo set to gather labeled vulnerable snippets as training seeds.

Fine-tune a compact CodeBERT-style model on CodeQL-labeled pairs and oversample positives to balance classes.

Integrate the model as an EditTime filter for code-completion outputs and tune detection threshold to ~1% positive rate for low churn.

Optimization Features

Model Optimization

  • Fine-tuning a compact CodeBERT trunk with a linear head for classification

System Optimization

  • Prompt-based zero-/few-shot runs require fetching examples per inference and increase runtime cost

Training Optimization

  • Oversample vulnerable examples to create 50/50 class balance per epoch

Inference Optimization

  • Use a small CodeBERT-base model to keep EditTime latency low (claimed <100ms scale)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training labels derive from CodeQL static analysis and inherit its coverage and false-negative biases.
  • Class imbalance and noisy labels from static analyzer scans can limit recall/precision trade-offs.
  • Zero- and few-shot settings produced many false positives due to worst-case speculation by large models.
  • Production precision is hard to measure because ground truth for all live snippets is unavailable.

When Not To Use

  • When you require formal guarantees or dynamic-execution proofs of absence of vulnerabilities.
  • For languages or CWE types not present in the training set (coverage gap).
  • If you cannot tolerate any developer-facing false positives without workflow changes.

Failure Modes

  • Overreach: zero-shot models flag hypothetical future vulnerabilities and inflate false positives.
  • Lack of context: truncated snippets can cause misclassification when initialization or scope matters.
  • Label noise: CodeQL misses some real vulnerabilities and labels can be inconsistent across repos.

Core Entities

Models

  • CodeBERT
  • code-davinci-002
  • text-davinci-003
  • code-cushman-001
  • CodeGen-2B
  • Codex

Metrics

  • Precision
  • Recall
  • F1-Score
  • Positive Rate
  • Vulnerability Reduction Rate

Datasets

  • DeepDevVuln training set (CodeQL on GitHub, ~500K vuln examples)
  • GitHub PR dataset (1,006 examples)
  • Pearce-style scenario benchmark (Python + expanded JS scenarios)
  • VulDeePecker
  • SeVC
  • ReVeal
  • FFmpeg+Qemu

Benchmarks

  • VulDeePecker
  • SeVC
  • ReVeal
  • FFmpeg+Qemu
  • Pearce et al. scenario benchmark (expanded)