A practical, up-to-date survey of LLMs focused on generating code from natural language

June 1, 20249 min

Overview

Decision SnapshotNeeds Validation

The survey compiles many empirical numbers and practical case studies; evidence is strong for benchmark gains and for the effect of instruction tuning, but real-world repo-level readiness and safety require more targeted benchmarks and on-prem evaluations.

Citations54

Evidence Strength0.85

Confidence0.88

Risk Signals14

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.

Who Should Care

Summary TLDR

This paper is a systematic, practice-oriented survey of Large Language Models used to generate source code from natural-language prompts. It reviews data sources and cleaning, pre-training and instruction-tuning, synthetic instruction generation, reinforcement learning from execution feedback, retrieval-augmented generation, repository-level and long-context methods, autonomous coding agents, and evaluation practices. It compiles benchmark numbers (HumanEval, MBPP, BigCodeBench), highlights that instruction tuning and synthetic data strongly move the needle, and argues that current function-level benchmarks are saturating—pushing the field toward repository-scale tasks and better safety/eval

Problem Statement

Code generation with LLMs is booming, but there is no up-to-date, focused literature review that covers data curation, instruction tuning, evaluation gaps, repository- and retrieval-level code generation, agentic systems, and safety implications. Practitioners need a single reference to compare models, data, and benchmarks and to identify which problems remain unsolved in realistic development.

Main Contribution

A taxonomy covering the full code-LLM lifecycle: data, pre-training, instruction tuning, feedback, retrieval, agents, and evaluation

A consolidated comparison of recent models on HumanEval, MBPP, and BigCodeBench with concrete pass@1 numbers

Key Findings

Models improved dramatically on small-function benchmarks over recent years.

NumbersHumanEval pass@1 rose from 3.6% (PaLM 8B) to 95.1% (LDB) as reported in the survey

Practical UseDon't judge current LLM code ability only on simple function tests; high pass@1 on HumanEval no longer guarantees real-world readiness

Evidence RefIntroduction paragraph and HumanEval discussion

Instruction tuning and synthetic instruction data substantially boost pass@1.

NumbersTable 10 shows average H+M pass@1 gains like ~26% for Qwen2.5-Coder-Instruct and ~35% for StarCoder2-Instruct

Practical UseFor practical gains, fine-tune or instruction-tune models (or use instruct variants) rather than only scaling parameters

Evidence RefSection 5.4 and Table 10

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HumanEval pass@1GPT-4o: 91%HumanEvalTable 9 reports GPT-4o-0513 pass@1 = 91%Table 9
HumanEval pass@1DeepSeek-Coder-V2-Instruct: 90.2%HumanEvalTable 9 reports 90.2% for DeepSeek-Coder-V2-InstructTable 9

What To Try In 7 Days

Run an open-source instruct-tuned model (e.g., Qwen2.5-Coder-Instruct or StarCoder2-Instruct) on a handful of internal unit-testable functions

Compare HuggingFace checkpoints vs an API model on your MBPP-like tasks and one repo-level task (use RepoEval or simple unit tests)

Integrate a retrieval step (docs + local repo) before generation and measure changes in failing tests and hallucinations

Agent Features

Memory
episodic memory (store reflections)short-term context memory
Planning
multi-step edit planstask decomposition into subgoals
Tool Use
unit test executorshell and bash toolsAPI/tool calling for build and run
Frameworks
MetaGPTAgentCoderAutoGenOpenDevinAgentCoder multi-agent pipeline
Is Agentic

Yes

Architectures
single-agentmulti-agentplanner-based (CodePlan)
Collaboration
role-based agents (Product Manager, Architect, Engineer)agent communication and coordination

Optimization Features

Token Efficiency
context compressionprompt compression
Infra Optimization
specialized hardware (TPU/NPU)activation-parameter-efficient MoE deployment
Model Optimization
MoEmodel distillationpruning
System Optimization
LoRAefficient GPU/TPU utilization
Training Optimization
instruction tuningRLdata synthesis (Self-/Evol-/OSS-Instruct)
Inference Optimization
prompt engineering (CoT, Self-Refine, PoT)context selection and selective retrievallong-context position encodings (RoPE, ALiBi)

Reproducibility

Risks & Boundaries

Limitations

Function-level benchmarks like HumanEval are saturating and do not reflect repository- or system-level coding challenges

Synthetic instruction datasets can introduce bias and lack coverage of rare or domain-specific cases

When Not To Use

Generating code for security-critical applications without human review and formal verification

Handling highly domain-specific or low-resource languages where training data is scarce

Failure Modes

Hallucinated APIs or incorrect library calls that compile but are semantically wrong

Leaking private or personal data memorized from training corpora

Core Entities

Models

GPT-4ChatGPT/GPT-3.5Claude-3.5PaLM-CoderCodexStarCoderStarCoder2WizardCoderDeepSeek-Coder-V2Qwen2.5-CoderCode LlamaCodeGemmaMagicoderCodestralPhi-1CodeGenAlphaCode

Metrics

pass@1Accuracytest-case averageCodeBLEUperplexity

Datasets

The StackThe Stack v2GitHub (BigQuery)The Pile (code subset)CodeParrotCodeSearchNetROOTSCommitPackFTCodeAlpaca-20KEvol-Instruct-Code-80kMagicoder-OSS-Instruct-75kSelf-OSS-Instruct-SC2-Exec-Filter-50k

Benchmarks

HumanEvalHumanEval+MBPPBigCodeBenchAPPSRepoEvalClassEvalSWE-benchMBXPLiveCodeBench