Teach code models to be secure by synthesizing vuln/fix pairs and a two-step generate that adds needed libraries

September 10, 20248 min

Overview

Decision SnapshotReady For Pilot

Results come from multiple models and two public benchmarks and show large reductions in detected vulnerabilities; however, detection relies on static analysis (CodeQL) and only Python/C/C++ were tested.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 70%

Authors

Hossein Hajipour, Lea Schönherr, Thorsten Holz, Mario Fritz

Links

Abstract / PDF / Code / Data

Why It Matters For Business

HexaCoder gives a practical, automatable path to reduce insecure code generation from LLMs by synthesizing repair data and fine-tuning models, lowering security risk in AI-assisted coding without harming productivity.

Who Should Care

Summary TLDR

HexaCoder builds a pipeline that (1) automatically synthesizes vulnerable code and GPT‑4 fixes guided by a static security oracle (CodeQL) and mitigation hints, (2) fine-tunes code models with these vuln/fix pairs via LoRA, and (3) uses a two-step generation at inference (first add libraries, then generate code). On three benchmarks and four CodeLMs, HexaCoder cuts vulnerable outputs dramatically (up to ~85% fewer in top-5 samples on an example) while keeping functional accuracy similar.

Problem Statement

Large code LLMs often generate insecure code because training corpora contain vulnerable examples. Curating large, labeled secure datasets is costly. We need an automated way to create training data that teaches models to produce secure code without breaking functionality.

Main Contribution

An oracle-guided synthesis pipeline that uses a static analyzer report plus mitigation hints to prompt GPT-4 to turn vulnerable code into fixed code, producing vuln/fix pairs at scale.

A LoRA-based fine-tuning procedure that trains CodeLMs on the synthesized vuln/fix pairs using a masked loss focused on tokens changed for security.

Key Findings

The synthesis pipeline repaired 1,776 out of 2,042 vulnerable samples.

Numbersfixed 1776/2042 (≈87.0%)

Practical UseYou can automatically produce a large, validated secure-code corpus (≈87% success) to fine-tune models rather than hand-labeling many examples.

Evidence RefSec.5.2; Table 2

Providing CodeQL report plus mitigation hints increased automated repair rates versus no report or CodeQL alone.

Numbersrepair rate avg: None 42.66% → CodeQL 63.33% → CodeQL+Hint 83.99%

Practical UseInclude both a static oracle output and short mitigation hints in prompts to get far better fixed-code outputs from a repair LLM.

Evidence RefSec.5.2.1; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Synthesis repair rate1,776 / 2,042 fixed (≈87.0%)vulnerable set from [32]We ran GPT‑4 guided by CodeQL reports; 1,776 out of 2,042 vulnerable samples produced fixed code with no CodeQL findingsSec.5.2; Table 2
Effect of security report components on repairNone 42.66% → CodeQL 63.33% → CodeQL+Hint 83.99%No report 42.66%+41.33 pp (CodeQL+Hint vs None)30 samples per CWE experimentRepair rates measured on 30 random samples/CWE with three prompt variantsSec.5.2.1; Table 3

What To Try In 7 Days

Run CodeQL on a sample of model outputs to measure current vuln rate.

Use an instruction-tuned LLM (e.g., GPT‑4) plus CodeQL reports to synthesize a small vuln/fix set for 1–2 CWE types.

LoRA-fine-tune a target CodeLM on that small set and validate top-5 outputs on the same static checks.

Optimization Features

Training Optimization
LoRAmasked loss focused on changed tokens

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Relies on CodeQL static analysis which can miss or misclassify vulnerabilities.

Evaluation only covers Python and C/C++; other languages not tested.

When Not To Use

If you need provable semantic-preservation of fixes—this pipeline does not formally verify functional equivalence.

When dynamic/runtime-only bugs (race conditions, some memory issues) are the primary concern—static analysis may miss them.

Failure Modes

Static oracle misses a vulnerability; the synthesized 'fixed' code remains insecure.

Model repair introduces functional regressions or removes intended behavior.

Core Entities

Models

CodeGen-350M-multiCodeGen-2B-multiInCoder-6BDeepSeek-Coder-V2-16BGPT-4GPT-3.5Codex

Metrics

number of vulnerable codes (top-k samples)pass@k (functional correctness)repair rate (fix success)

Datasets

Synthesized vuln/fix pairs (1,776 examples)CodeLMSec benchmarkPearce et al. benchmarkHumanEval

Benchmarks

CodeLMSecPearce et al. (Copilot scenarios)HumanEval