Teach code models to be secure by synthesizing vuln/fix pairs and a two-step generate that adds needed libraries

Overview

Decision SnapshotReady For Pilot

Results come from multiple models and two public benchmarks and show large reductions in detected vulnerabilities; however, detection relies on static analysis (CodeQL) and only Python/C/C++ were tested.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 70%

Authors

Hossein Hajipour, Lea Schönherr, Thorsten Holz, Mario Fritz

Links

Abstract / PDF / Code / Data

Why It Matters For Business

HexaCoder gives a practical, automatable path to reduce insecure code generation from LLMs by synthesizing repair data and fine-tuning models, lowering security risk in AI-assisted coding without harming productivity.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

HexaCoder builds a pipeline that (1) automatically synthesizes vulnerable code and GPT‑4 fixes guided by a static security oracle (CodeQL) and mitigation hints, (2) fine-tunes code models with these vuln/fix pairs via LoRA, and (3) uses a two-step generation at inference (first add libraries, then generate code). On three benchmarks and four CodeLMs, HexaCoder cuts vulnerable outputs dramatically (up to ~85% fewer in top-5 samples on an example) while keeping functional accuracy similar.

Problem Statement

Large code LLMs often generate insecure code because training corpora contain vulnerable examples. Curating large, labeled secure datasets is costly. We need an automated way to create training data that teaches models to produce secure code without breaking functionality.

Main Contribution

An oracle-guided synthesis pipeline that uses a static analyzer report plus mitigation hints to prompt GPT-4 to turn vulnerable code into fixed code, producing vuln/fix pairs at scale.

A LoRA-based fine-tuning procedure that trains CodeLMs on the synthesized vuln/fix pairs using a masked loss focused on tokens changed for security.

Key Findings

The synthesis pipeline repaired 1,776 out of 2,042 vulnerable samples.

Numbersfixed 1776/2042 (≈87.0%)

Practical UseYou can automatically produce a large, validated secure-code corpus (≈87% success) to fine-tune models rather than hand-labeling many examples.

Evidence RefSec.5.2; Table 2

Providing CodeQL report plus mitigation hints increased automated repair rates versus no report or CodeQL alone.

Numbersrepair rate avg: None 42.66% → CodeQL 63.33% → CodeQL+Hint 83.99%

Practical UseInclude both a static oracle output and short mitigation hints in prompts to get far better fixed-code outputs from a repair LLM.

Evidence RefSec.5.2.1; Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Synthesis repair rate	1,776 / 2,042 fixed (≈87.0%)	—	—	vulnerable set from [32]	We ran GPT‑4 guided by CodeQL reports; 1,776 out of 2,042 vulnerable samples produced fixed code with no CodeQL findings	Sec.5.2; Table 2
Effect of security report components on repair	None 42.66% → CodeQL 63.33% → CodeQL+Hint 83.99%	No report 42.66%	+41.33 pp (CodeQL+Hint vs None)	30 samples per CWE experiment	Repair rates measured on 30 random samples/CWE with three prompt variants	Sec.5.2.1; Table 3

What To Try In 7 Days

Run CodeQL on a sample of model outputs to measure current vuln rate.

Use an instruction-tuned LLM (e.g., GPT‑4) plus CodeQL reports to synthesize a small vuln/fix set for 1–2 CWE types.

LoRA-fine-tune a target CodeLM on that small set and validate top-5 outputs on the same static checks.

Optimization Features

Training Optimization

LoRAmasked loss focused on changed tokens

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/hexacoder-ai/hexacoder

Data URLs

https://github.com/hexacoder-ai/hexacoder

Risks & Boundaries

Limitations

Relies on CodeQL static analysis which can miss or misclassify vulnerabilities.

Evaluation only covers Python and C/C++; other languages not tested.

When Not To Use

If you need provable semantic-preservation of fixes—this pipeline does not formally verify functional equivalence.

When dynamic/runtime-only bugs (race conditions, some memory issues) are the primary concern—static analysis may miss them.

Failure Modes

Static oracle misses a vulnerability; the synthesized 'fixed' code remains insecure.

Model repair introduces functional regressions or removes intended behavior.

Core Entities

Models

CodeGen-350M-multiCodeGen-2B-multiInCoder-6BDeepSeek-Coder-V2-16BGPT-4GPT-3.5Codex

Metrics

number of vulnerable codes (top-k samples)pass@k (functional correctness)repair rate (fix success)

Datasets

Synthesized vuln/fix pairs (1,776 examples)CodeLMSec benchmarkPearce et al. benchmarkHumanEval

Benchmarks

CodeLMSecPearce et al. (Copilot scenarios)HumanEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

The synthesis pipeline repaired 1,776 out of 2,042 vulnerable samples.

Providing CodeQL report plus mitigation hints increased automated repair rates versus no report or CodeQL alone.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding