Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

March 31, 20237 min

Overview

Decision SnapshotReady For Pilot

Experiments use three open LLMs and three public benchmarks with execution-based metrics and human evaluation; results are strong on those datasets but rely on having a relevant retrieval corpus and chosen sampling/hyperparameters.

Citations16

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, Zhi Jin

Links

Abstract / PDF / Data

Why It Matters For Business

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Who Should Care

Summary TLDR

AceCoder is a prompting technique for code-generation LLMs that (1) retrieves similar code examples and selects non-redundant ones, and (2) asks the model to emit an intermediate software artifact (by default: test cases) before the final code. On three open benchmarks (Python, Java, JavaScript) and three open LLMs, AceCoder raises execution-based accuracy (Pass@k) substantially versus few-shot and retrieval baselines. Human reviewers also rate its outputs as more correct and maintainable. The method is lightweight (no fine-tuning) but depends on having a retrieval corpus with relevant examples.

Problem Statement

Current prompting methods were built for natural language and underperform on code. Code generation needs two things: clear requirement understanding (what to write) and useful implementation examples (how to write). Off-the-shelf few-shot and chain-of-thought prompts miss one or both.

Main Contribution

A prompting pipeline (AceCoder) combining example retrieval, an example selector, and a prompt analyzer that injects an intermediate preliminary (e.g., test cases) into each example.

Guided code generation: force the model to first emit a preliminary (test cases) to clarify inputs/outputs and edge cases before producing code.

Key Findings

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

NumbersPass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

Practical UseIf you can supply a retrieval corpus, add AceCoder-style retrieved examples + generated test-case preliminaries to prompts to substantially raise single-output correctness on these benchmarks.

Evidence RefTable 2, RQ1

AceCoder improves over retrieval-based prompt baselines (which just insert similar code).

NumbersPass@1 up to +13.1% (MBPP); +23.44% (MBJP); +15.8% (MBJSP) vs Jigsaw

Practical UseUse a selector and analyzer (not just raw retrieval) to filter redundancy and add test-case guidance for better results.

Evidence RefTable 3, RQ2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pass@1 (AceCoder vs few-shot)MBPP: 26.74% vs 20.40% (CodeGeeX-13B)few-shot prompting+31.1% relativeMBPP (Python)Table 2 (CodeGeeX-13B row)Table 2
Pass@1 (AceCoder vs few-shot)MBJP: 28.38% vs 16.63% (CodeGeeX-13B)few-shot prompting+70.7% relativeMBJP (Java)Table 2 (CodeGeeX-13B row)Table 2

What To Try In 7 Days

Index your codebase or public snippets with BM25/Lucene and use problem text as queries.

Implement a simple selector (ngram overlap + decay) to pick 2–3 nonredundant examples.

Add example triples with test-case preliminaries to your prompt template and compare Pass@1 on a small set of tasks.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

MBPPMBJPMBJSP

Risks & Boundaries

Limitations

Requires a retrieval corpus with relevant examples; performance falls back to few-shot when retrieval fails.

Relies on datasets that include test cases; extracting preliminaries from arbitrary codebases may be noisy.

When Not To Use

You lack a searchable corpus of relevant code or cannot extract test cases/ preliminaries.

Prompt length or inference budget prevents adding multiple example triples.

Failure Modes

Retrieved examples are irrelevant or misleading, producing wrong guidance.

Selector criteria could omit rare but critical examples if n-gram overlap misses semantic similarity.

Core Entities

Models

CodeGeeX-13BCodeGen-6BInCoder-6BCodex

Metrics

Pass@1Pass@3Pass@5

Datasets

MBPPMBJPMBJSP

Benchmarks

Pass@k

Context Entities

Models

REDCODERJigsawPLBART

Metrics

ROUGE-NBLEUBM25 score

Datasets

HumanEvalAPPSCodeContest

Benchmarks

execution-based evaluation