Coding Benchmarks Papers — Parsed & Scored for Practitioners

EvalPlus: auto-generated tests reveal up to ~29% lower pass rates and 11% bad 'ground-truth' in HumanEval

0.70

0.60

171

Small test suites give falsely high confidence in AI-generated code; automated, larger testing exposes real failure rates and helps select safer models for production.

Key finding

Automated augmentation increases tests per task from single-digit to hundreds.

Numbers: HumanEval avg tests 9.6 → HUMANEVAL+ avg 764.1

LACA: use GPT-3.5 to speed deductive qualitative coding while checking reliability

0.60

0.50

0.70

61

LLMs can cut the time and cost of large-scale manual coding while keeping results comparable to humans for many categories; validate on a small sample before scaling.

Key finding

GPT-3.5 often matches human agreement on many coding tasks.

Numbers: Human-model Gwet's AC1 frequently ≥0.76; examples MAGA 0.98, MEDI 0.96

A practical, up-to-date survey of LLMs focused on generating code from natural language

0.70

0.60

0.80

54

Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.

Key finding

Models improved dramatically on small-function benchmarks over recent years.

Numbers: HumanEval pass@1 rose from 3.6% (PaLM 8B) to 95.1% (LDB) as reported in the survey

Taxonomy and lightweight mitigation of hallucinations in LLM-generated code

0.50

0.60

30

Hallucinations in LLM-generated code often break functionality and raise debugging, maintenance, and security costs; detecting and reducing them yields outsized gains in correctness without retraining models.

Key finding

Code hallucinations occur often: 1,134 of 3,120 samples contained hallucinations.

Numbers: 1,134/3,120 samples; 1,212 hallucinatory snippets

Live, contamination-aware benchmark for code LLMs that tests generation, repair, execution, and test-output prediction

0.70

0.65

0.55

22

LiveCodeBench reveals real gaps between closed and open models and the presence of training-set leakage; use it to benchmark models on realistic, recent contest problems and avoid inflated performance claims from contaminated or small benchmarks.

Key finding

Some models show clear contamination: DeepSeek and GPT-4-O performance drops on problems released after their stated cutoff dates.

Numbers: DS-Base-33B: Pass@1 ~60 (May) → ~0 (Sep) on LeetCode

Make LLMs think in program structures to improve code generation

0.70

0.60

0.50

17

SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.

Key finding

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

Numbers: Pass@1 +13.79% (CoT 53.29 → SCoT 60.64)

Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

0.60

0.50

16

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Key finding

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

Numbers: Pass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

VerilogEval: an automated sandbox and 156-problem benchmark to test LLMs on Verilog code and to study synthetic fine-tuning

0.40

0.60

16

Automated functional tests let teams measure whether LLMs actually produce correct Verilog behavior; synthetic SFT can raise single-shot correctness and make open models competitive with closed APIs for small HDL tasks.

Key finding

Synthetic SFT boosts single-sample correctness (pass@1) on Verilog tasks.

Numbers: codegen-2B-verilog: pass@1 20.1% -> 35.9% after SFT

CYBERSECEVAL: a broad benchmark measuring insecure code and malicious compliance in code-capable LLMs

0.70

0.80

0.70

15

Code-capable LLMs frequently suggest insecure code and may comply with malicious requests, so firms should test models automatically and add safety controls before deployment.

Key finding

Models produced vulnerable code a substantial fraction of the time.

Numbers: 30% of completions were vulnerable on CYBERSECEVAL tests

BigCodeBench: a 1,140-task Python benchmark testing multi-tool function calls and complex instructions

0.60

0.50

15

BigCodeBench reveals real gaps in LLMs for practical coding: models commonly mis-use APIs, omit setup, and perform worse on concise human instructions, so production systems should include execution tests, human review, and domain-specialized models.

Key finding

Top model solves roughly 60% of tasks on structured docstrings.

Numbers: Pass@1 = 0.602 (GPT-4o, Complete)

Survey: how LLMs and LLM-based agents reshape software engineering workflows

0.60

0.45

0.60

15

LLM-based agents enable higher automation for multi-step engineering tasks (planning, tool use, testing) and often raise real pass rates and reduce human iteration; single LLMs remain cheaper for isolated code generation or simple analysis.

Key finding

Survey corpus and venue split — the review covers 139 papers and many are preprints.

Numbers: 139 papers; arXiv accounts for 40.3% of papers

Agentless: a simple three-step workflow (localize, repair, validate) that matches or beats open-source agents on SWE-bench Lite while slasH‑

0.70

0.60

0.80

13

A focused, non-agentic pipeline cuts cost and engineering overhead while matching or exceeding many open-source agentic systems on repo-level bug fixes.

Key finding

AGENTLESS resolves 96 of 300 SWE-bench Lite problems

Numbers: 96/300 = 32.00%

RRTF trains a 15B code LLM by ranking test-and-teacher outputs; PanGu-Coder2 hits ~62% pass@1 on HumanEval

0.70

0.60

0.70

13

RRTF provides a lower-cost, scalable way to improve code-generation correctness by using unit tests and stronger-model outputs as ranked supervision; this delivers higher-quality code models that are faster and cheaper to run after quantization.

Key finding

PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.

Numbers: pass@1=61.64% (n=200 sampling); greedy pass@1=62.20%

BinSum: a 557K-function benchmark showing when LLMs can (and cannot) summarize binary code

0.60

0.70

0.65

12

Automated binary summaries can speed reverse engineering and threat triage, but quality hinges on decompilation and symbol availability; investing in decompilers and symbol recovery yields bigger gains than swapping models.

Key finding

Stripping debugging symbols dramatically reduces decompiled-code semantics.

Numbers: 55.0% drop in semantic similarity (0.449 -> 0.202)

RTLLM: a 30-design benchmark for generating and evaluating RTL from natural-language, plus a 'self-planning' prompt trick

0.60

0.50

0.70

12

RTLLM gives a standardized, automated way to test LLMs for RTL generation including practical PPA outcomes; a cheap prompt tweak (self-planning) can cut error rates and save engineering time compared with manual fixes.

Key finding

GPT-4 produced syntactically valid RTL in 81% of trials and passed functionality on 15 of 30 designs.

Numbers: syntax 81%; functionality 15/30

RepoBench: an evaluation suite for retrieval, next-line completion, and full pipelines on multi-file code

0.60

0.50

0.60

11

RepoBench measures retrieval plus completion across multiple files, which reflects real engineering workflows and helps teams pick retrievers, prompt formats, and models that actually improve developer productivity.

Key finding

Semantic retriever UniXcoder substantially improves retrieval accuracy over random and lexical baselines.

Numbers: UniXcoder acc@1 27.02 vs random 15.72 (Easy XF-F, Python)

xCodeEval — a 7-task, execution-first benchmark with millions of runnable, multilingual code examples

0.60

0.70

10

xCodeEval measures real functional correctness across many languages. Use it to benchmark developer-facing code assistants, choose runtimes for evaluation, and avoid over-relying on lexical metrics that miss runtime failures.

Key finding

xCodeEval is large and multilingual.

Numbers: 25M samples; 16.5B tokens; ~7.5K problems; up to 17 languages (Table 8, Abstract)

CrossCodeEval: 10k multilingual examples that force models to read other files to complete code

0.70

0.60

10

Real-world code completions often need other files; adding a retrieval step can roughly double correct completions and should be part of any practical code-assist product pipeline.

Key finding

Off-the-shelf models fail on cross-file examples when only given the current file.

Numbers: StarCoder-15.5B Python EM 8.82% (in-file only)

Find code security bugs while the developer types using transformer models

0.75

0.45

0.70

7

Catching vulnerabilities while code is being written shortens fix time and cost; the paper shows large reductions in vulnerable completions from code LMs and near-90% reduction in production JS edits when integrated into an editor.

Key finding

Fine-tuned CodeBERT (DeepDevVuln) has the best F1 balance on their GitHub PR test set.

Numbers: Precision 58.87%, Recall 63.00%, F1 60.87% (Table 3)

Sallm: an automated benchmark, dataset, and metrics for measuring code-security of LLMs

0.60

0.65

0.45

6

Generated code can be functionally correct but insecure; automated security benchmarks reveal trade-offs between correctness and security so teams can pick models and pipelines that match risk tolerance.

Key finding

Repair component greatly increases executability of model outputs.

Numbers: compilation rate from 15% to 75% (avg); GPT-4 from <1% to 89%

LLMs often produce executable but unsafe Java code — GPT-4 had ~62% API misuse on StackOverflow-style questions.

0.30

0.50

0.60

6

LLM snippets can run but still be unsafe: unchecked API misuse can cause crashes, leaks, or data loss if pushed to production, so teams must verify LLM code before deployment.

Key finding

Most compilable LLM answers contain API misuses.

Numbers: 57–70% misuse among compilable answers (evaluated models, zero/one-shot)

CodeS: open-source 1B–15B models that match or beat much larger LLMs on text-to-SQL benchmarks

0.80

0.60

0.75

5

CodeS offers near-SOTA text-to-SQL accuracy with far smaller, open models that cut inference cost and preserve data privacy; use a 7B model for fast local deployment.

Key finding

Incremental SQL-centric pre-training substantially improves SQL generation compared to base StarCoder.

Numbers: CodeS-15B 5-shot Spider TS 73.4% vs StarCoder-15B 70.0% (Table 4)

A realistic, evolving benchmark for repository-level code generation drawn from recent GitHub projects

0.30

0.60

0.50

5

EvoCodeBench reveals that state-of-the-art LLMs often fail on real repository tasks; test on repo-aligned data and include local contexts to avoid bad deployment surprises.

Key finding

EvoCodeBench-2403 size and distribution match recent repositories.

Numbers: 275 samples, 25 repos; standalone 27% / non-standalone 73%; avg dependencies 3.46

EFFIBENCH: a 1,000-problem benchmark that measures runtime and memory of LLM-generated Python solutions

0.80

0.65

0.75

4

Model-generated code can be functionally correct but significantly slower and more memory-hungry; this raises real costs in production, cloud bills, latency-sensitive services, and energy footprint.

Key finding

Model-generated code is usually slower than optimized human solutions.

Numbers: GPT-4 average NET = 3.12x (generated time / canonical time)