Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
13
Why It Matters For Business
RRTF provides a lower-cost, scalable way to improve code-generation correctness by using unit tests and stronger-model outputs as ranked supervision; this delivers higher-quality code models that are faster and cheaper to run after quantization.
Summary TLDR
The authors propose RRTF (Rank Responses to align Test & Teacher Feedback): sample multiple code outputs, score them by unit tests and teacher preferences, then train with a ranking loss plus supervised loss. Applied to StarCoder 15B, this produces PanGu-Coder2. On HumanEval it achieves ~61.6% pass@1 (n=200 sampling) and 62.2% greedy pass@1, outperforming prior open-source code models on HumanEval, CoderEval and a curated LeetCode set. The paper also reports inference optimizations (FlashAttention, quantization) and emphasizes dataset size and 3–4 training epochs as key factors.
Problem Statement
RL approaches that directly use unit-test results are slow, unstable, and costly for large code models. The paper proposes a simpler, data-efficient alternative that ranks candidate programs (by tests and stronger 'teacher' outputs) and fine-tunes a code model using ranking and cross-entropy losses to boost functional correctness.
Main Contribution
Propose RRTF, a ranking-feedback fine-tuning pipeline that uses unit tests and teacher-model outputs as ranked supervision.
Produce PanGu-Coder2 (15B) by applying RRTF to StarCoder 15B, improving pass@1 on HumanEval to ~61.6% (n=200) and 62.2% (greedy).
Show practical deployment steps: dataset construction via Evol-Instruct (down to 68k pairs), training details (global batch 512, 6 epochs), and inference optimizations including quantization.
Key Findings
PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.
PanGu-Coder2 improves over the previous best open model (WizardCoder) on HumanEval.
Quantization and engine optimizations cut GPU memory and speed up inference substantially.
Training benefits from larger, varied instruction-solution data and converges in ~3–4 epochs.
Results
HumanEval pass@1 (n=200 sampling)
HumanEval greedy pass@1
CoderEval greedy pass@1
LeetCode greedy (easy/medium/hard solved counts)
Inference memory & speed (float16 -> int8 CTranslate2)
Who Should Care
What To Try In 7 Days
Replicate a small RRTF loop: sample student+teacher outputs on ~100 problems, rank by unit tests, fine-tune with ranking loss.
Build a parallel unit-test runner to label outputs as compile/runtime/partial/all-pass and use that ranking for supervised selection.
Quantize a finalist model with CTranslate2 int8 and measure memory and latency vs float16 on representative prompts.
Optimization Features
Token Efficiency
- greedy decoding recommended for practical evaluation
Infra Optimization
- use of efficient attention and quantization reduces GPU memory footprint
Model Optimization
- decoder-only Transformer with Multi-Query-Attention
- learned absolute positional embeddings
- max context length 8192
System Optimization
- parallel offline sampling and large-scale parallel test execution
Training Optimization
- RRTF ranking loss plus cross-entropy on teacher output
- Evol-Instruct to expand instruction-solution pairs
- global batch size 512; trained 6 epochs on 15B model
Inference Optimization
- FlashAttention to reduce compute and memory
- CTranslate2 int8 quantization for faster, smaller inference
- GPTQ-based quantization experiments
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Ranking depends on unit tests and teacher quality; weak tests or biased teachers limit gains.
- Reported dataset (68k) and model checkpoints are not publicly released here, hindering exact reproduction.
- Improvements shown on standard benchmarks; real-world codebases may differ in distribution and safety needs.
When Not To Use
- If you lack reliable unit tests to rank outputs.
- When you need instruction-following with verbose human-style comments rather than strict functional correctness.
- On very small models where sampling and ranking costs outweigh fine-tuning benefits.
Failure Modes
- Teacher outputs can bias the student toward teacher mistakes when teacher score filtering is imperfect.
- Ranking ties resolved in favor of teachers may reduce diversity and miss novel correct solutions.
- Quantization (GPTQ) sometimes degraded quality; verify on your test-suite.
Core Entities
Models
- PanGu-Coder2 (15B)
- StarCoder (15B)
- WizardCoder (15B)
- CodeGen-mono (16B)
- CodeT5+ (16B)
- CodeGeeX (13B)
- GPT-3.5
- GPT-4
Metrics
- pass@1
- pass@10
- pass@100
- greedy pass@1
- ms/token
- GPU memory (GB)
Datasets
- HumanEval
- CoderEval
- LeetCode (post-2022.7 subset)
- CodeAlpaca-20k (base for Evol-Instruct)
- Evol-Instruct generated dataset (68k after filtering)
Benchmarks
- HumanEval
- CoderEval
- LeetCode

