Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
16
Why It Matters For Business
AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.
Summary TLDR
AceCoder is a prompting technique for code-generation LLMs that (1) retrieves similar code examples and selects non-redundant ones, and (2) asks the model to emit an intermediate software artifact (by default: test cases) before the final code. On three open benchmarks (Python, Java, JavaScript) and three open LLMs, AceCoder raises execution-based accuracy (Pass@k) substantially versus few-shot and retrieval baselines. Human reviewers also rate its outputs as more correct and maintainable. The method is lightweight (no fine-tuning) but depends on having a retrieval corpus with relevant examples.
Problem Statement
Current prompting methods were built for natural language and underperform on code. Code generation needs two things: clear requirement understanding (what to write) and useful implementation examples (how to write). Off-the-shelf few-shot and chain-of-thought prompts miss one or both.
Main Contribution
A prompting pipeline (AceCoder) combining example retrieval, an example selector, and a prompt analyzer that injects an intermediate preliminary (e.g., test cases) into each example.
Guided code generation: force the model to first emit a preliminary (test cases) to clarify inputs/outputs and edge cases before producing code.
An n-gram-based selector that de-duplicates retrieved examples so prompts contain informative, non-redundant programs.
Extensive evaluation on three open LLMs and three public benchmarks showing large gains in Pass@k and favorable human ratings.
Key Findings
AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.
AceCoder improves over retrieval-based prompt baselines (which just insert similar code).
Human developers prefer AceCoder outputs across correctness, code smell, and maintainability.
Each module (retriever, selector, analyzer) contributes; ablation reduces performance.
Results
Pass@1 (AceCoder vs few-shot)
Pass@1 (AceCoder vs few-shot)
Pass@1 (AceCoder vs few-shot)
Who Should Care
What To Try In 7 Days
Index your codebase or public snippets with BM25/Lucene and use problem text as queries.
Implement a simple selector (ngram overlap + decay) to pick 2–3 nonredundant examples.
Add example triples with test-case preliminaries to your prompt template and compare Pass@1 on a small set of tasks.
Reproducibility
Data Urls
- MBPP
- MBJP
- MBJSP
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires a retrieval corpus with relevant examples; performance falls back to few-shot when retrieval fails.
- Relies on datasets that include test cases; extracting preliminaries from arbitrary codebases may be noisy.
- Evaluations use open LLMs (6B–13B); closed models (e.g., current API services) were not tested here.
When Not To Use
- You lack a searchable corpus of relevant code or cannot extract test cases/ preliminaries.
- Prompt length or inference budget prevents adding multiple example triples.
- You need guarantees beyond benchmark-style test cases (e.g., full formal verification).
Failure Modes
- Retrieved examples are irrelevant or misleading, producing wrong guidance.
- Selector criteria could omit rare but critical examples if n-gram overlap misses semantic similarity.
- Generated preliminaries (test cases) can be incorrect or incomplete, leading the model astray.
Core Entities
Models
- CodeGeeX-13B
- CodeGen-6B
- InCoder-6B
- Codex
Metrics
- Pass@1
- Pass@3
- Pass@5
Datasets
- MBPP
- MBJP
- MBJSP
Benchmarks
- Pass@k
Context Entities
Models
- REDCODER
- Jigsaw
- PLBART
Metrics
- ROUGE-N
- BLEU
- BM25 score
Datasets
- HumanEval
- APPS
- CodeContest
Benchmarks
- execution-based evaluation

