Overview
Production Readiness
0.65
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
HexaCoder gives a practical, automatable path to reduce insecure code generation from LLMs by synthesizing repair data and fine-tuning models, lowering security risk in AI-assisted coding without harming productivity.
Summary TLDR
HexaCoder builds a pipeline that (1) automatically synthesizes vulnerable code and GPT‑4 fixes guided by a static security oracle (CodeQL) and mitigation hints, (2) fine-tunes code models with these vuln/fix pairs via LoRA, and (3) uses a two-step generation at inference (first add libraries, then generate code). On three benchmarks and four CodeLMs, HexaCoder cuts vulnerable outputs dramatically (up to ~85% fewer in top-5 samples on an example) while keeping functional accuracy similar.
Problem Statement
Large code LLMs often generate insecure code because training corpora contain vulnerable examples. Curating large, labeled secure datasets is costly. We need an automated way to create training data that teaches models to produce secure code without breaking functionality.
Main Contribution
An oracle-guided synthesis pipeline that uses a static analyzer report plus mitigation hints to prompt GPT-4 to turn vulnerable code into fixed code, producing vuln/fix pairs at scale.
A LoRA-based fine-tuning procedure that trains CodeLMs on the synthesized vuln/fix pairs using a masked loss focused on tokens changed for security.
A two-step inference method that first generates required libraries/modules and then completes the code, helping the model include security-relevant libraries and reducing vulnerabilities.
Key Findings
The synthesis pipeline repaired 1,776 out of 2,042 vulnerable samples.
Providing CodeQL report plus mitigation hints increased automated repair rates versus no report or CodeQL alone.
Fine-tuning + two-step generation reduced vulnerable outputs by up to ~85% on a tested model and benchmark.
HexaCoder preserved or modestly improved functional correctness on evaluated tasks.
Results
Synthesis repair rate
Effect of security report components on repair
Vulnerable outputs reduced (example)
Functional correctness (example)
Two-step generation impact (example)
Who Should Care
What To Try In 7 Days
Run CodeQL on a sample of model outputs to measure current vuln rate.
Use an instruction-tuned LLM (e.g., GPT‑4) plus CodeQL reports to synthesize a small vuln/fix set for 1–2 CWE types.
LoRA-fine-tune a target CodeLM on that small set and validate top-5 outputs on the same static checks.
Optimization Features
Training Optimization
- LoRA
- masked loss focused on changed tokens
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Relies on CodeQL static analysis which can miss or misclassify vulnerabilities.
- Evaluation only covers Python and C/C++; other languages not tested.
- Fixed code from GPT‑4 may change behavior; functionality preservation is not guaranteed for every example.
- Two-step generation can insert unused libraries; extra cleanup may be needed.
When Not To Use
- If you need provable semantic-preservation of fixes—this pipeline does not formally verify functional equivalence.
- When dynamic/runtime-only bugs (race conditions, some memory issues) are the primary concern—static analysis may miss them.
- If you cannot accept dependency on a commercial repair model like GPT‑4 and do not plan to release replacement models.
Failure Modes
- Static oracle misses a vulnerability; the synthesized 'fixed' code remains insecure.
- Model repair introduces functional regressions or removes intended behavior.
- Data imbalance: few examples for a CWE can lead to worse results (noted for CWE-020).
Core Entities
Models
- CodeGen-350M-multi
- CodeGen-2B-multi
- InCoder-6B
- DeepSeek-Coder-V2-16B
- GPT-4
- GPT-3.5
- Codex
Metrics
- number of vulnerable codes (top-k samples)
- pass@k (functional correctness)
- repair rate (fix success)
Datasets
- Synthesized vuln/fix pairs (1,776 examples)
- CodeLMSec benchmark
- Pearce et al. benchmark
- HumanEval
Benchmarks
- CodeLMSec
- Pearce et al. (Copilot scenarios)
- HumanEval

