Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

Overview

Decision SnapshotReady For Pilot

Experiments use three open LLMs and three public benchmarks with execution-based metrics and human evaluation; results are strong on those datasets but rely on having a relevant retrieval corpus and chosen sampling/hyperparameters.

Citations16

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, Zhi Jin

Links

Abstract / PDF / Data

Why It Matters For Business

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

AceCoder is a prompting technique for code-generation LLMs that (1) retrieves similar code examples and selects non-redundant ones, and (2) asks the model to emit an intermediate software artifact (by default: test cases) before the final code. On three open benchmarks (Python, Java, JavaScript) and three open LLMs, AceCoder raises execution-based accuracy (Pass@k) substantially versus few-shot and retrieval baselines. Human reviewers also rate its outputs as more correct and maintainable. The method is lightweight (no fine-tuning) but depends on having a retrieval corpus with relevant examples.

Problem Statement

Current prompting methods were built for natural language and underperform on code. Code generation needs two things: clear requirement understanding (what to write) and useful implementation examples (how to write). Off-the-shelf few-shot and chain-of-thought prompts miss one or both.

Main Contribution

A prompting pipeline (AceCoder) combining example retrieval, an example selector, and a prompt analyzer that injects an intermediate preliminary (e.g., test cases) into each example.

Guided code generation: force the model to first emit a preliminary (test cases) to clarify inputs/outputs and edge cases before producing code.

Key Findings

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

NumbersPass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

Practical UseIf you can supply a retrieval corpus, add AceCoder-style retrieved examples + generated test-case preliminaries to prompts to substantially raise single-output correctness on these benchmarks.

Evidence RefTable 2, RQ1

AceCoder improves over retrieval-based prompt baselines (which just insert similar code).

NumbersPass@1 up to +13.1% (MBPP); +23.44% (MBJP); +15.8% (MBJSP) vs Jigsaw

Practical UseUse a selector and analyzer (not just raw retrieval) to filter redundancy and add test-case guidance for better results.

Evidence RefTable 3, RQ2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pass@1 (AceCoder vs few-shot)	MBPP: 26.74% vs 20.40% (CodeGeeX-13B)	few-shot prompting	+31.1% relative	MBPP (Python)	Table 2 (CodeGeeX-13B row)	Table 2
Pass@1 (AceCoder vs few-shot)	MBJP: 28.38% vs 16.63% (CodeGeeX-13B)	few-shot prompting	+70.7% relative	MBJP (Java)	Table 2 (CodeGeeX-13B row)	Table 2

What To Try In 7 Days

Index your codebase or public snippets with BM25/Lucene and use problem text as queries.

Implement a simple selector (ngram overlap + decay) to pick 2–3 nonredundant examples.

Add example triples with test-case preliminaries to your prompt template and compare Pass@1 on a small set of tasks.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

MBPPMBJPMBJSP

Risks & Boundaries

Limitations

Requires a retrieval corpus with relevant examples; performance falls back to few-shot when retrieval fails.

Relies on datasets that include test cases; extracting preliminaries from arbitrary codebases may be noisy.

When Not To Use

You lack a searchable corpus of relevant code or cannot extract test cases/ preliminaries.

Prompt length or inference budget prevents adding multiple example triples.

Failure Modes

Retrieved examples are irrelevant or misleading, producing wrong guidance.

Selector criteria could omit rare but critical examples if n-gram overlap misses semantic similarity.

Core Entities

Models

CodeGeeX-13BCodeGen-6BInCoder-6BCodex

Metrics

Pass@1Pass@3Pass@5

Datasets

MBPPMBJPMBJSP

Benchmarks

Pass@k

Context Entities

Models

REDCODERJigsawPLBART

Metrics

ROUGE-NBLEUBM25 score

Datasets

HumanEvalAPPSCodeContest

Benchmarks

execution-based evaluation

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

AceCoder improves over retrieval-based prompt baselines (which just insert similar code).

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding