Overview
The method is practical and reproducible at scale: ranking avoids PPO complexity, yields strong benchmark gains, and pairs well with standard quantization for deployment; however full reproduction needs code/data release.
Citations13
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
RRTF provides a lower-cost, scalable way to improve code-generation correctness by using unit tests and stronger-model outputs as ranked supervision; this delivers higher-quality code models that are faster and cheaper to run after quantization.
Who Should Care
Summary TLDR
The authors propose RRTF (Rank Responses to align Test & Teacher Feedback): sample multiple code outputs, score them by unit tests and teacher preferences, then train with a ranking loss plus supervised loss. Applied to StarCoder 15B, this produces PanGu-Coder2. On HumanEval it achieves ~61.6% pass@1 (n=200 sampling) and 62.2% greedy pass@1, outperforming prior open-source code models on HumanEval, CoderEval and a curated LeetCode set. The paper also reports inference optimizations (FlashAttention, quantization) and emphasizes dataset size and 3–4 training epochs as key factors.
Problem Statement
RL approaches that directly use unit-test results are slow, unstable, and costly for large code models. The paper proposes a simpler, data-efficient alternative that ranks candidate programs (by tests and stronger 'teacher' outputs) and fine-tunes a code model using ranking and cross-entropy losses to boost functional correctness.
Main Contribution
Propose RRTF, a ranking-feedback fine-tuning pipeline that uses unit tests and teacher-model outputs as ranked supervision.
Produce PanGu-Coder2 (15B) by applying RRTF to StarCoder 15B, improving pass@1 on HumanEval to ~61.6% (n=200) and 62.2% (greedy).
Key Findings
PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.
PanGu-Coder2 improves over the previous best open model (WizardCoder) on HumanEval.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| HumanEval pass@1 (n=200 sampling) | PanGu-Coder2 61.64% | WizardCoder 57.30% | +4.34% abs | HumanEval | Table 2 reports pass@k with n=200 sampling and temperatures | Table 2 |
| HumanEval greedy pass@1 | PanGu-Coder2 62.20% | StarCoder 32.93% | +29.27% abs | HumanEval (greedy) | Table 3 compares greedy decoding pass@1 across models | Table 3 |
What To Try In 7 Days
Replicate a small RRTF loop: sample student+teacher outputs on ~100 problems, rank by unit tests, fine-tune with ranking loss.
Build a parallel unit-test runner to label outputs as compile/runtime/partial/all-pass and use that ranking for supervised selection.
Quantize a finalist model with CTranslate2 int8 and measure memory and latency vs float16 on representative prompts.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Ranking depends on unit tests and teacher quality; weak tests or biased teachers limit gains.
Reported dataset (68k) and model checkpoints are not publicly released here, hindering exact reproduction.
When Not To Use
If you lack reliable unit tests to rank outputs.
When you need instruction-following with verbose human-style comments rather than strict functional correctness.
Failure Modes
Teacher outputs can bias the student toward teacher mistakes when teacher score filtering is imperfect.
Ranking ties resolved in favor of teachers may reduce diversity and miss novel correct solutions.

