Overview
Experiments use three open LLMs and three public benchmarks with execution-based metrics and human evaluation; results are strong on those datasets but rely on having a relevant retrieval corpus and chosen sampling/hyperparameters.
Citations16
Evidence Strength0.85
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.
Who Should Care
Summary TLDR
AceCoder is a prompting technique for code-generation LLMs that (1) retrieves similar code examples and selects non-redundant ones, and (2) asks the model to emit an intermediate software artifact (by default: test cases) before the final code. On three open benchmarks (Python, Java, JavaScript) and three open LLMs, AceCoder raises execution-based accuracy (Pass@k) substantially versus few-shot and retrieval baselines. Human reviewers also rate its outputs as more correct and maintainable. The method is lightweight (no fine-tuning) but depends on having a retrieval corpus with relevant examples.
Problem Statement
Current prompting methods were built for natural language and underperform on code. Code generation needs two things: clear requirement understanding (what to write) and useful implementation examples (how to write). Off-the-shelf few-shot and chain-of-thought prompts miss one or both.
Main Contribution
A prompting pipeline (AceCoder) combining example retrieval, an example selector, and a prompt analyzer that injects an intermediate preliminary (e.g., test cases) into each example.
Guided code generation: force the model to first emit a preliminary (test cases) to clarify inputs/outputs and edge cases before producing code.
Key Findings
AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.
AceCoder improves over retrieval-based prompt baselines (which just insert similar code).
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Pass@1 (AceCoder vs few-shot) | MBPP: 26.74% vs 20.40% (CodeGeeX-13B) | few-shot prompting | +31.1% relative | MBPP (Python) | Table 2 (CodeGeeX-13B row) | Table 2 |
| Pass@1 (AceCoder vs few-shot) | MBJP: 28.38% vs 16.63% (CodeGeeX-13B) | few-shot prompting | +70.7% relative | MBJP (Java) | Table 2 (CodeGeeX-13B row) | Table 2 |
What To Try In 7 Days
Index your codebase or public snippets with BM25/Lucene and use problem text as queries.
Implement a simple selector (ngram overlap + decay) to pick 2–3 nonredundant examples.
Add example triples with test-case preliminaries to your prompt template and compare Pass@1 on a small set of tasks.
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires a retrieval corpus with relevant examples; performance falls back to few-shot when retrieval fails.
Relies on datasets that include test cases; extracting preliminaries from arbitrary codebases may be noisy.
When Not To Use
You lack a searchable corpus of relevant code or cannot extract test cases/ preliminaries.
Prompt length or inference budget prevents adding multiple example triples.
Failure Modes
Retrieved examples are irrelevant or misleading, producing wrong guidance.
Selector criteria could omit rare but critical examples if n-gram overlap misses semantic similarity.

