Teach code models to be secure by synthesizing vuln/fix pairs and a two-step generate that adds needed libraries

September 10, 20248 min

Overview

Production Readiness

0.65

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

1

Authors

Hossein Hajipour, Lea Schönherr, Thorsten Holz, Mario Fritz

Links

Abstract / PDF

Why It Matters For Business

HexaCoder gives a practical, automatable path to reduce insecure code generation from LLMs by synthesizing repair data and fine-tuning models, lowering security risk in AI-assisted coding without harming productivity.

Summary TLDR

HexaCoder builds a pipeline that (1) automatically synthesizes vulnerable code and GPT‑4 fixes guided by a static security oracle (CodeQL) and mitigation hints, (2) fine-tunes code models with these vuln/fix pairs via LoRA, and (3) uses a two-step generation at inference (first add libraries, then generate code). On three benchmarks and four CodeLMs, HexaCoder cuts vulnerable outputs dramatically (up to ~85% fewer in top-5 samples on an example) while keeping functional accuracy similar.

Problem Statement

Large code LLMs often generate insecure code because training corpora contain vulnerable examples. Curating large, labeled secure datasets is costly. We need an automated way to create training data that teaches models to produce secure code without breaking functionality.

Main Contribution

An oracle-guided synthesis pipeline that uses a static analyzer report plus mitigation hints to prompt GPT-4 to turn vulnerable code into fixed code, producing vuln/fix pairs at scale.

A LoRA-based fine-tuning procedure that trains CodeLMs on the synthesized vuln/fix pairs using a masked loss focused on tokens changed for security.

A two-step inference method that first generates required libraries/modules and then completes the code, helping the model include security-relevant libraries and reducing vulnerabilities.

Key Findings

The synthesis pipeline repaired 1,776 out of 2,042 vulnerable samples.

Numbersfixed 1776/2042 (≈87.0%)

Providing CodeQL report plus mitigation hints increased automated repair rates versus no report or CodeQL alone.

Numbersrepair rate avg: None 42.66% → CodeQL 63.33% → CodeQL+Hint 83.99%

Fine-tuning + two-step generation reduced vulnerable outputs by up to ~85% on a tested model and benchmark.

NumbersCodeGen-2B top-5 vulnerable: Base 456 → HexaCoder 65 (≈85.7% reduction)

HexaCoder preserved or modestly improved functional correctness on evaluated tasks.

NumbersDeepSeek pass@10: Base 70.5 → HexaCoder 72.0; other models comparable

Results

Synthesis repair rate

Value1,776 / 2,042 fixed (≈87.0%)

Effect of security report components on repair

ValueNone 42.66% → CodeQL 63.33% → CodeQL+Hint 83.99%

BaselineNo report 42.66%

Vulnerable outputs reduced (example)

ValueCodeGen-2B-multi top-5 vulnerable: 456 → 65

BaselineBase model top-5 = 456

Functional correctness (example)

ValueDeepSeek pass@10: 70.5 → 72.0

BaselineBase pass@10 = 70.5

Two-step generation impact (example)

ValueCodeGen-350M top-5: HexaCoder without Two 174 → with Two 81

BaselineHexaCoder w/o Two = 174

Who Should Care

What To Try In 7 Days

Run CodeQL on a sample of model outputs to measure current vuln rate.

Use an instruction-tuned LLM (e.g., GPT‑4) plus CodeQL reports to synthesize a small vuln/fix set for 1–2 CWE types.

LoRA-fine-tune a target CodeLM on that small set and validate top-5 outputs on the same static checks.

Optimization Features

Training Optimization

  • LoRA
  • masked loss focused on changed tokens

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on CodeQL static analysis which can miss or misclassify vulnerabilities.
  • Evaluation only covers Python and C/C++; other languages not tested.
  • Fixed code from GPT‑4 may change behavior; functionality preservation is not guaranteed for every example.
  • Two-step generation can insert unused libraries; extra cleanup may be needed.

When Not To Use

  • If you need provable semantic-preservation of fixes—this pipeline does not formally verify functional equivalence.
  • When dynamic/runtime-only bugs (race conditions, some memory issues) are the primary concern—static analysis may miss them.
  • If you cannot accept dependency on a commercial repair model like GPT‑4 and do not plan to release replacement models.

Failure Modes

  • Static oracle misses a vulnerability; the synthesized 'fixed' code remains insecure.
  • Model repair introduces functional regressions or removes intended behavior.
  • Data imbalance: few examples for a CWE can lead to worse results (noted for CWE-020).

Core Entities

Models

  • CodeGen-350M-multi
  • CodeGen-2B-multi
  • InCoder-6B
  • DeepSeek-Coder-V2-16B
  • GPT-4
  • GPT-3.5
  • Codex

Metrics

  • number of vulnerable codes (top-k samples)
  • pass@k (functional correctness)
  • repair rate (fix success)

Datasets

  • Synthesized vuln/fix pairs (1,776 examples)
  • CodeLMSec benchmark
  • Pearce et al. benchmark
  • HumanEval

Benchmarks

  • CodeLMSec
  • Pearce et al. (Copilot scenarios)
  • HumanEval