Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
54
Why It Matters For Business
Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.
Summary TLDR
This paper is a systematic, practice-oriented survey of Large Language Models used to generate source code from natural-language prompts. It reviews data sources and cleaning, pre-training and instruction-tuning, synthetic instruction generation, reinforcement learning from execution feedback, retrieval-augmented generation, repository-level and long-context methods, autonomous coding agents, and evaluation practices. It compiles benchmark numbers (HumanEval, MBPP, BigCodeBench), highlights that instruction tuning and synthetic data strongly move the needle, and argues that current function-level benchmarks are saturating—pushing the field toward repository-scale tasks and better safety/eval
Problem Statement
Code generation with LLMs is booming, but there is no up-to-date, focused literature review that covers data curation, instruction tuning, evaluation gaps, repository- and retrieval-level code generation, agentic systems, and safety implications. Practitioners need a single reference to compare models, data, and benchmarks and to identify which problems remain unsolved in realistic development.
Main Contribution
A taxonomy covering the full code-LLM lifecycle: data, pre-training, instruction tuning, feedback, retrieval, agents, and evaluation
A consolidated comparison of recent models on HumanEval, MBPP, and BigCodeBench with concrete pass@1 numbers
A focused review of data curation and synthetic instruction generation methods (Self-Instruct, Evol-Instruct, OSS-Instruct)
A discussion of practical gaps: repository-level generation, evaluation blind spots, safety, license/privacy risks, and environmental costs
A public GitHub resource page for ongoing updates
Key Findings
Models improved dramatically on small-function benchmarks over recent years.
Instruction tuning and synthetic instruction data substantially boost pass@1.
Open-source code models now rival large closed models on standard benchmarks.
Function-level benchmarks are saturating and miss real development challenges.
Results
HumanEval pass@1
HumanEval pass@1
HumanEval pass@1
MBPP pass@1
BigCodeBench pass@1
Who Should Care
What To Try In 7 Days
Run an open-source instruct-tuned model (e.g., Qwen2.5-Coder-Instruct or StarCoder2-Instruct) on a handful of internal unit-testable functions
Compare HuggingFace checkpoints vs an API model on your MBPP-like tasks and one repo-level task (use RepoEval or simple unit tests)
Integrate a retrieval step (docs + local repo) before generation and measure changes in failing tests and hallucinations
Agent Features
Memory
- episodic memory (store reflections)
- short-term context memory
Planning
- multi-step edit plans
- task decomposition into subgoals
Tool Use
- unit test executor
- shell and bash tools
- API/tool calling for build and run
Frameworks
- MetaGPT
- AgentCoder
- AutoGen
- OpenDevin
- AgentCoder multi-agent pipeline
Is Agentic
true
Architectures
- single-agent
- multi-agent
- planner-based (CodePlan)
Collaboration
- role-based agents (Product Manager, Architect, Engineer)
- agent communication and coordination
Optimization Features
Token Efficiency
- context compression
- prompt compression
Infra Optimization
- specialized hardware (TPU/NPU)
- activation-parameter-efficient MoE deployment
Model Optimization
- MoE
- model distillation
- pruning
System Optimization
- LoRA
- efficient GPU/TPU utilization
Training Optimization
- instruction tuning
- RL
- data synthesis (Self-/Evol-/OSS-Instruct)
Inference Optimization
- prompt engineering (CoT, Self-Refine, PoT)
- context selection and selective retrieval
- long-context position encodings (RoPE, ALiBi)
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Function-level benchmarks like HumanEval are saturating and do not reflect repository- or system-level coding challenges
- Synthetic instruction datasets can introduce bias and lack coverage of rare or domain-specific cases
- Privacy and memorization risks from raw crawled code remain unresolved without careful filtering
- Repository-level generation still struggles with cross-file context, naming conventions, and long-range dependencies
- LLM-based evaluation (LLM-as-a-judge) inherits biases and reasoning limits of the judge model
When Not To Use
- Generating code for security-critical applications without human review and formal verification
- Handling highly domain-specific or low-resource languages where training data is scarce
- Large repository refactors without retrieval and careful testing
- Situations demanding provable guarantees or strict license compliance
Failure Modes
- Hallucinated APIs or incorrect library calls that compile but are semantically wrong
- Leaking private or personal data memorized from training corpora
- Generating insecure or vulnerable code patterns
- Over-reliance on retrieved context that is noisy or out-of-date
- Degradation on long, multi-file tasks due to context length limits
Core Entities
Models
- GPT-4
- ChatGPT/GPT-3.5
- Claude-3.5
- PaLM-Coder
- Codex
- StarCoder
- StarCoder2
- WizardCoder
- DeepSeek-Coder-V2
- Qwen2.5-Coder
- Code Llama
- CodeGemma
- Magicoder
- Codestral
- Phi-1
- CodeGen
- AlphaCode
Metrics
- pass@1
- Accuracy
- test-case average
- CodeBLEU
- perplexity
Datasets
- The Stack
- The Stack v2
- GitHub (BigQuery)
- The Pile (code subset)
- CodeParrot
- CodeSearchNet
- ROOTS
- CommitPackFT
- CodeAlpaca-20K
- Evol-Instruct-Code-80k
- Magicoder-OSS-Instruct-75k
- Self-OSS-Instruct-SC2-Exec-Filter-50k
Benchmarks
- HumanEval
- HumanEval+
- MBPP
- BigCodeBench
- APPS
- RepoEval
- ClassEval
- SWE-bench
- MBXP
- LiveCodeBench

