Overview
The survey compiles many empirical numbers and practical case studies; evidence is strong for benchmark gains and for the effect of instruction tuning, but real-world repo-level readiness and safety require more targeted benchmarks and on-prem evaluations.
Citations54
Evidence Strength0.85
Confidence0.88
Risk Signals14
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.
Who Should Care
Summary TLDR
This paper is a systematic, practice-oriented survey of Large Language Models used to generate source code from natural-language prompts. It reviews data sources and cleaning, pre-training and instruction-tuning, synthetic instruction generation, reinforcement learning from execution feedback, retrieval-augmented generation, repository-level and long-context methods, autonomous coding agents, and evaluation practices. It compiles benchmark numbers (HumanEval, MBPP, BigCodeBench), highlights that instruction tuning and synthetic data strongly move the needle, and argues that current function-level benchmarks are saturating—pushing the field toward repository-scale tasks and better safety/eval
Problem Statement
Code generation with LLMs is booming, but there is no up-to-date, focused literature review that covers data curation, instruction tuning, evaluation gaps, repository- and retrieval-level code generation, agentic systems, and safety implications. Practitioners need a single reference to compare models, data, and benchmarks and to identify which problems remain unsolved in realistic development.
Main Contribution
A taxonomy covering the full code-LLM lifecycle: data, pre-training, instruction tuning, feedback, retrieval, agents, and evaluation
A consolidated comparison of recent models on HumanEval, MBPP, and BigCodeBench with concrete pass@1 numbers
Key Findings
Models improved dramatically on small-function benchmarks over recent years.
Instruction tuning and synthetic instruction data substantially boost pass@1.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| HumanEval pass@1 | GPT-4o: 91% | — | — | HumanEval | Table 9 reports GPT-4o-0513 pass@1 = 91% | Table 9 |
| HumanEval pass@1 | DeepSeek-Coder-V2-Instruct: 90.2% | — | — | HumanEval | Table 9 reports 90.2% for DeepSeek-Coder-V2-Instruct | Table 9 |
What To Try In 7 Days
Run an open-source instruct-tuned model (e.g., Qwen2.5-Coder-Instruct or StarCoder2-Instruct) on a handful of internal unit-testable functions
Compare HuggingFace checkpoints vs an API model on your MBPP-like tasks and one repo-level task (use RepoEval or simple unit tests)
Integrate a retrieval step (docs + local repo) before generation and measure changes in failing tests and hallucinations
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Function-level benchmarks like HumanEval are saturating and do not reflect repository- or system-level coding challenges
Synthetic instruction datasets can introduce bias and lack coverage of rare or domain-specific cases
When Not To Use
Generating code for security-critical applications without human review and formal verification
Handling highly domain-specific or low-resource languages where training data is scarce
Failure Modes
Hallucinated APIs or incorrect library calls that compile but are semantically wrong
Leaking private or personal data memorized from training corpora

