Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

March 31, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

16

Authors

Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, Zhi Jin

Links

Abstract / PDF

Why It Matters For Business

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Summary TLDR

AceCoder is a prompting technique for code-generation LLMs that (1) retrieves similar code examples and selects non-redundant ones, and (2) asks the model to emit an intermediate software artifact (by default: test cases) before the final code. On three open benchmarks (Python, Java, JavaScript) and three open LLMs, AceCoder raises execution-based accuracy (Pass@k) substantially versus few-shot and retrieval baselines. Human reviewers also rate its outputs as more correct and maintainable. The method is lightweight (no fine-tuning) but depends on having a retrieval corpus with relevant examples.

Problem Statement

Current prompting methods were built for natural language and underperform on code. Code generation needs two things: clear requirement understanding (what to write) and useful implementation examples (how to write). Off-the-shelf few-shot and chain-of-thought prompts miss one or both.

Main Contribution

A prompting pipeline (AceCoder) combining example retrieval, an example selector, and a prompt analyzer that injects an intermediate preliminary (e.g., test cases) into each example.

Guided code generation: force the model to first emit a preliminary (test cases) to clarify inputs/outputs and edge cases before producing code.

An n-gram-based selector that de-duplicates retrieved examples so prompts contain informative, non-redundant programs.

Extensive evaluation on three open LLMs and three public benchmarks showing large gains in Pass@k and favorable human ratings.

Key Findings

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

NumbersPass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

AceCoder improves over retrieval-based prompt baselines (which just insert similar code).

NumbersPass@1 up to +13.1% (MBPP); +23.44% (MBJP); +15.8% (MBJSP) vs Jigsaw

Human developers prefer AceCoder outputs across correctness, code smell, and maintainability.

NumbersCorrectness +61.8%; Code smell +33.7%; Maintainability +13.8%

Each module (retriever, selector, analyzer) contributes; ablation reduces performance.

NumbersAblation: retriever adds +17.6% Pass@1 (MBPP); full pipeline adds +31.1% (MBPP)

Results

Pass@1 (AceCoder vs few-shot)

ValueMBPP: 26.74% vs 20.40% (CodeGeeX-13B)

Baselinefew-shot prompting

Pass@1 (AceCoder vs few-shot)

ValueMBJP: 28.38% vs 16.63% (CodeGeeX-13B)

Baselinefew-shot prompting

Pass@1 (AceCoder vs few-shot)

ValueMBJSP: 21.03% vs 11.16% (CodeGeeX-13B)

Baselinefew-shot prompting

Who Should Care

What To Try In 7 Days

Index your codebase or public snippets with BM25/Lucene and use problem text as queries.

Implement a simple selector (ngram overlap + decay) to pick 2–3 nonredundant examples.

Add example triples with test-case preliminaries to your prompt template and compare Pass@1 on a small set of tasks.

Reproducibility

Data Urls

  • MBPP
  • MBJP
  • MBJSP

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires a retrieval corpus with relevant examples; performance falls back to few-shot when retrieval fails.
  • Relies on datasets that include test cases; extracting preliminaries from arbitrary codebases may be noisy.
  • Evaluations use open LLMs (6B–13B); closed models (e.g., current API services) were not tested here.

When Not To Use

  • You lack a searchable corpus of relevant code or cannot extract test cases/ preliminaries.
  • Prompt length or inference budget prevents adding multiple example triples.
  • You need guarantees beyond benchmark-style test cases (e.g., full formal verification).

Failure Modes

  • Retrieved examples are irrelevant or misleading, producing wrong guidance.
  • Selector criteria could omit rare but critical examples if n-gram overlap misses semantic similarity.
  • Generated preliminaries (test cases) can be incorrect or incomplete, leading the model astray.

Core Entities

Models

  • CodeGeeX-13B
  • CodeGen-6B
  • InCoder-6B
  • Codex

Metrics

  • Pass@1
  • Pass@3
  • Pass@5

Datasets

  • MBPP
  • MBJP
  • MBJSP

Benchmarks

  • Pass@k

Context Entities

Models

  • REDCODER
  • Jigsaw
  • PLBART

Metrics

  • ROUGE-N
  • BLEU
  • BM25 score

Datasets

  • HumanEval
  • APPS
  • CodeContest

Benchmarks

  • execution-based evaluation