A practical, up-to-date survey of LLMs focused on generating code from natural language

June 1, 20249 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

54

Authors

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim

Links

Abstract / PDF

Why It Matters For Business

Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.

Summary TLDR

This paper is a systematic, practice-oriented survey of Large Language Models used to generate source code from natural-language prompts. It reviews data sources and cleaning, pre-training and instruction-tuning, synthetic instruction generation, reinforcement learning from execution feedback, retrieval-augmented generation, repository-level and long-context methods, autonomous coding agents, and evaluation practices. It compiles benchmark numbers (HumanEval, MBPP, BigCodeBench), highlights that instruction tuning and synthetic data strongly move the needle, and argues that current function-level benchmarks are saturating—pushing the field toward repository-scale tasks and better safety/eval

Problem Statement

Code generation with LLMs is booming, but there is no up-to-date, focused literature review that covers data curation, instruction tuning, evaluation gaps, repository- and retrieval-level code generation, agentic systems, and safety implications. Practitioners need a single reference to compare models, data, and benchmarks and to identify which problems remain unsolved in realistic development.

Main Contribution

A taxonomy covering the full code-LLM lifecycle: data, pre-training, instruction tuning, feedback, retrieval, agents, and evaluation

A consolidated comparison of recent models on HumanEval, MBPP, and BigCodeBench with concrete pass@1 numbers

A focused review of data curation and synthetic instruction generation methods (Self-Instruct, Evol-Instruct, OSS-Instruct)

A discussion of practical gaps: repository-level generation, evaluation blind spots, safety, license/privacy risks, and environmental costs

A public GitHub resource page for ongoing updates

Key Findings

Models improved dramatically on small-function benchmarks over recent years.

NumbersHumanEval pass@1 rose from 3.6% (PaLM 8B) to 95.1% (LDB) as reported in the survey

Instruction tuning and synthetic instruction data substantially boost pass@1.

NumbersTable 10 shows average H+M pass@1 gains like ~26% for Qwen2.5-Coder-Instruct and ~35% for StarCoder2-Instruct

Open-source code models now rival large closed models on standard benchmarks.

NumbersDeepSeek-Coder-V2-Instruct 90.2% and Qwen2.5-Coder-Instruct 88.4% vs Claude-3.5-Sonnet 92% on HumanEval (Table 9)

Function-level benchmarks are saturating and miss real development challenges.

NumbersMany models score >70%–90% pass@1 on HumanEval but still underperform on BigCodeBench and repo tasks; BigCodeBench top ~

Results

HumanEval pass@1

ValueGPT-4o: 91%

HumanEval pass@1

ValueDeepSeek-Coder-V2-Instruct: 90.2%

HumanEval pass@1

ValueQwen2.5-Coder-Instruct (7B): 88.4%

BaselineQwen2.5-Coder (7B): 61.6%

MBPP pass@1

ValueQwen2.5-Coder-Instruct (7B): 83.5%

BaselineGPT-3.5-Turbo: 52.2%

BigCodeBench pass@1

ValueDeepSeek-Coder-V2-Instruct: 59.7%

BaselineGPT-4o-0513: 61.1%

Who Should Care

What To Try In 7 Days

Run an open-source instruct-tuned model (e.g., Qwen2.5-Coder-Instruct or StarCoder2-Instruct) on a handful of internal unit-testable functions

Compare HuggingFace checkpoints vs an API model on your MBPP-like tasks and one repo-level task (use RepoEval or simple unit tests)

Integrate a retrieval step (docs + local repo) before generation and measure changes in failing tests and hallucinations

Agent Features

Memory

  • episodic memory (store reflections)
  • short-term context memory

Planning

  • multi-step edit plans
  • task decomposition into subgoals

Tool Use

  • unit test executor
  • shell and bash tools
  • API/tool calling for build and run

Frameworks

  • MetaGPT
  • AgentCoder
  • AutoGen
  • OpenDevin
  • AgentCoder multi-agent pipeline

Is Agentic

true

Architectures

  • single-agent
  • multi-agent
  • planner-based (CodePlan)

Collaboration

  • role-based agents (Product Manager, Architect, Engineer)
  • agent communication and coordination

Optimization Features

Token Efficiency

  • context compression
  • prompt compression

Infra Optimization

  • specialized hardware (TPU/NPU)
  • activation-parameter-efficient MoE deployment

Model Optimization

  • MoE
  • model distillation
  • pruning

System Optimization

  • LoRA
  • efficient GPU/TPU utilization

Training Optimization

  • instruction tuning
  • RL
  • data synthesis (Self-/Evol-/OSS-Instruct)

Inference Optimization

  • prompt engineering (CoT, Self-Refine, PoT)
  • context selection and selective retrieval
  • long-context position encodings (RoPE, ALiBi)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Function-level benchmarks like HumanEval are saturating and do not reflect repository- or system-level coding challenges
  • Synthetic instruction datasets can introduce bias and lack coverage of rare or domain-specific cases
  • Privacy and memorization risks from raw crawled code remain unresolved without careful filtering
  • Repository-level generation still struggles with cross-file context, naming conventions, and long-range dependencies
  • LLM-based evaluation (LLM-as-a-judge) inherits biases and reasoning limits of the judge model

When Not To Use

  • Generating code for security-critical applications without human review and formal verification
  • Handling highly domain-specific or low-resource languages where training data is scarce
  • Large repository refactors without retrieval and careful testing
  • Situations demanding provable guarantees or strict license compliance

Failure Modes

  • Hallucinated APIs or incorrect library calls that compile but are semantically wrong
  • Leaking private or personal data memorized from training corpora
  • Generating insecure or vulnerable code patterns
  • Over-reliance on retrieved context that is noisy or out-of-date
  • Degradation on long, multi-file tasks due to context length limits

Core Entities

Models

  • GPT-4
  • ChatGPT/GPT-3.5
  • Claude-3.5
  • PaLM-Coder
  • Codex
  • StarCoder
  • StarCoder2
  • WizardCoder
  • DeepSeek-Coder-V2
  • Qwen2.5-Coder
  • Code Llama
  • CodeGemma
  • Magicoder
  • Codestral
  • Phi-1
  • CodeGen
  • AlphaCode

Metrics

  • pass@1
  • Accuracy
  • test-case average
  • CodeBLEU
  • perplexity

Datasets

  • The Stack
  • The Stack v2
  • GitHub (BigQuery)
  • The Pile (code subset)
  • CodeParrot
  • CodeSearchNet
  • ROOTS
  • CommitPackFT
  • CodeAlpaca-20K
  • Evol-Instruct-Code-80k
  • Magicoder-OSS-Instruct-75k
  • Self-OSS-Instruct-SC2-Exec-Filter-50k

Benchmarks

  • HumanEval
  • HumanEval+
  • MBPP
  • BigCodeBench
  • APPS
  • RepoEval
  • ClassEval
  • SWE-bench
  • MBXP
  • LiveCodeBench