A practical, up-to-date survey of LLMs focused on generating code from natural language

Overview

Decision SnapshotNeeds Validation

The survey compiles many empirical numbers and practical case studies; evidence is strong for benchmark gains and for the effect of instruction tuning, but real-world repo-level readiness and safety require more targeted benchmarks and on-prem evaluations.

Citations54

Evidence Strength0.85

Confidence0.88

Risk Signals14

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper is a systematic, practice-oriented survey of Large Language Models used to generate source code from natural-language prompts. It reviews data sources and cleaning, pre-training and instruction-tuning, synthetic instruction generation, reinforcement learning from execution feedback, retrieval-augmented generation, repository-level and long-context methods, autonomous coding agents, and evaluation practices. It compiles benchmark numbers (HumanEval, MBPP, BigCodeBench), highlights that instruction tuning and synthetic data strongly move the needle, and argues that current function-level benchmarks are saturating—pushing the field toward repository-scale tasks and better safety/eval

Problem Statement

Code generation with LLMs is booming, but there is no up-to-date, focused literature review that covers data curation, instruction tuning, evaluation gaps, repository- and retrieval-level code generation, agentic systems, and safety implications. Practitioners need a single reference to compare models, data, and benchmarks and to identify which problems remain unsolved in realistic development.

Main Contribution

A taxonomy covering the full code-LLM lifecycle: data, pre-training, instruction tuning, feedback, retrieval, agents, and evaluation

A consolidated comparison of recent models on HumanEval, MBPP, and BigCodeBench with concrete pass@1 numbers

Key Findings

Models improved dramatically on small-function benchmarks over recent years.

NumbersHumanEval pass@1 rose from 3.6% (PaLM 8B) to 95.1% (LDB) as reported in the survey

Practical UseDon't judge current LLM code ability only on simple function tests; high pass@1 on HumanEval no longer guarantees real-world readiness

Evidence RefIntroduction paragraph and HumanEval discussion

Instruction tuning and synthetic instruction data substantially boost pass@1.

NumbersTable 10 shows average H+M pass@1 gains like ~26% for Qwen2.5-Coder-Instruct and ~35% for StarCoder2-Instruct

Practical UseFor practical gains, fine-tune or instruction-tune models (or use instruct variants) rather than only scaling parameters

Evidence RefSection 5.4 and Table 10

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HumanEval pass@1	GPT-4o: 91%	—	—	HumanEval	Table 9 reports GPT-4o-0513 pass@1 = 91%	Table 9
HumanEval pass@1	DeepSeek-Coder-V2-Instruct: 90.2%	—	—	HumanEval	Table 9 reports 90.2% for DeepSeek-Coder-V2-Instruct	Table 9

What To Try In 7 Days

Run an open-source instruct-tuned model (e.g., Qwen2.5-Coder-Instruct or StarCoder2-Instruct) on a handful of internal unit-testable functions

Compare HuggingFace checkpoints vs an API model on your MBPP-like tasks and one repo-level task (use RepoEval or simple unit tests)

Integrate a retrieval step (docs + local repo) before generation and measure changes in failing tests and hallucinations

Agent Features

Memory

episodic memory (store reflections)short-term context memory

Planning

multi-step edit planstask decomposition into subgoals

Tool Use

unit test executorshell and bash toolsAPI/tool calling for build and run

Frameworks

MetaGPTAgentCoderAutoGenOpenDevinAgentCoder multi-agent pipeline

Is Agentic

Yes

Architectures

single-agentmulti-agentplanner-based (CodePlan)

Collaboration

role-based agents (Product Manager, Architect, Engineer)agent communication and coordination

Optimization Features

Token Efficiency

context compressionprompt compression

Infra Optimization

specialized hardware (TPU/NPU)activation-parameter-efficient MoE deployment

Model Optimization

MoEmodel distillationpruning

System Optimization

LoRAefficient GPU/TPU utilization

Training Optimization

instruction tuningRLdata synthesis (Self-/Evol-/OSS-Instruct)

Inference Optimization

prompt engineering (CoT, Self-Refine, PoT)context selection and selective retrievallong-context position encodings (RoPE, ALiBi)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/juyongjiang/CodeLLMSurvey https://github.com/bigcode-project/starcoder2-self-align

Data URLs

https://huggingface.co/datasets/bigcode/the-stack https://huggingface.co/datasets/EleutherAI/pile https://huggingface.co/datasets/transformersbook/codeparrot

Risks & Boundaries

Limitations

Function-level benchmarks like HumanEval are saturating and do not reflect repository- or system-level coding challenges

Synthetic instruction datasets can introduce bias and lack coverage of rare or domain-specific cases

When Not To Use

Generating code for security-critical applications without human review and formal verification

Handling highly domain-specific or low-resource languages where training data is scarce

Failure Modes

Hallucinated APIs or incorrect library calls that compile but are semantically wrong

Leaking private or personal data memorized from training corpora

Core Entities

Models

GPT-4ChatGPT/GPT-3.5Claude-3.5PaLM-CoderCodexStarCoderStarCoder2WizardCoderDeepSeek-Coder-V2Qwen2.5-CoderCode LlamaCodeGemmaMagicoderCodestralPhi-1CodeGenAlphaCode

Metrics

pass@1Accuracytest-case averageCodeBLEUperplexity

Datasets

The StackThe Stack v2GitHub (BigQuery)The Pile (code subset)CodeParrotCodeSearchNetROOTSCommitPackFTCodeAlpaca-20KEvol-Instruct-Code-80kMagicoder-OSS-Instruct-75kSelf-OSS-Instruct-SC2-Exec-Filter-50k

Benchmarks

HumanEvalHumanEval+MBPPBigCodeBenchAPPSRepoEvalClassEvalSWE-benchMBXPLiveCodeBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Models improved dramatically on small-function benchmarks over recent years.

Instruction tuning and synthetic instruction data substantially boost pass@1.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding