RRTF trains a 15B code LLM by ranking test-and-teacher outputs; PanGu-Coder2 hits ~62% pass@1 on HumanEval

July 27, 20237 min

Overview

Decision SnapshotReady For Pilot

The method is practical and reproducible at scale: ranking avoids PPO complexity, yields strong benchmark gains, and pairs well with standard quantization for deployment; however full reproduction needs code/data release.

Citations13

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, Qianxiang Wang

Links

Abstract / PDF

Why It Matters For Business

RRTF provides a lower-cost, scalable way to improve code-generation correctness by using unit tests and stronger-model outputs as ranked supervision; this delivers higher-quality code models that are faster and cheaper to run after quantization.

Who Should Care

Summary TLDR

The authors propose RRTF (Rank Responses to align Test & Teacher Feedback): sample multiple code outputs, score them by unit tests and teacher preferences, then train with a ranking loss plus supervised loss. Applied to StarCoder 15B, this produces PanGu-Coder2. On HumanEval it achieves ~61.6% pass@1 (n=200 sampling) and 62.2% greedy pass@1, outperforming prior open-source code models on HumanEval, CoderEval and a curated LeetCode set. The paper also reports inference optimizations (FlashAttention, quantization) and emphasizes dataset size and 3–4 training epochs as key factors.

Problem Statement

RL approaches that directly use unit-test results are slow, unstable, and costly for large code models. The paper proposes a simpler, data-efficient alternative that ranks candidate programs (by tests and stronger 'teacher' outputs) and fine-tunes a code model using ranking and cross-entropy losses to boost functional correctness.

Main Contribution

Propose RRTF, a ranking-feedback fine-tuning pipeline that uses unit tests and teacher-model outputs as ranked supervision.

Produce PanGu-Coder2 (15B) by applying RRTF to StarCoder 15B, improving pass@1 on HumanEval to ~61.6% (n=200) and 62.2% (greedy).

Key Findings

PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.

Numberspass@1=61.64% (n=200 sampling); greedy pass@1=62.20%

Practical UseIf you need a top open-source code model for functional correctness, try RRTF-finetuned 15B models and evaluate with greedy decoding first.

Evidence RefTable 2 and Table 3

PanGu-Coder2 improves over the previous best open model (WizardCoder) on HumanEval.

Numbers61.64% vs WizardCoder 57.30% pass@1 (relative +4.34% abs)

Practical UseRanking-based fine-tuning can yield measurable gains over pure instruction-tuning on code benchmarks.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HumanEval pass@1 (n=200 sampling)PanGu-Coder2 61.64%WizardCoder 57.30%+4.34% absHumanEvalTable 2 reports pass@k with n=200 sampling and temperaturesTable 2
HumanEval greedy pass@1PanGu-Coder2 62.20%StarCoder 32.93%+29.27% absHumanEval (greedy)Table 3 compares greedy decoding pass@1 across modelsTable 3

What To Try In 7 Days

Replicate a small RRTF loop: sample student+teacher outputs on ~100 problems, rank by unit tests, fine-tune with ranking loss.

Build a parallel unit-test runner to label outputs as compile/runtime/partial/all-pass and use that ranking for supervised selection.

Quantize a finalist model with CTranslate2 int8 and measure memory and latency vs float16 on representative prompts.

Optimization Features

Token Efficiency
greedy decoding recommended for practical evaluation
Infra Optimization
use of efficient attention and quantization reduces GPU memory footprint
Model Optimization
decoder-only Transformer with Multi-Query-Attentionlearned absolute positional embeddingsmax context length 8192
System Optimization
parallel offline sampling and large-scale parallel test execution
Training Optimization
RRTF ranking loss plus cross-entropy on teacher outputEvol-Instruct to expand instruction-solution pairsglobal batch size 512; trained 6 epochs on 15B model
Inference Optimization
FlashAttention to reduce compute and memoryCTranslate2 int8 quantization for faster, smaller inferenceGPTQ-based quantization experiments

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Ranking depends on unit tests and teacher quality; weak tests or biased teachers limit gains.

Reported dataset (68k) and model checkpoints are not publicly released here, hindering exact reproduction.

When Not To Use

If you lack reliable unit tests to rank outputs.

When you need instruction-following with verbose human-style comments rather than strict functional correctness.

Failure Modes

Teacher outputs can bias the student toward teacher mistakes when teacher score filtering is imperfect.

Ranking ties resolved in favor of teachers may reduce diversity and miss novel correct solutions.

Core Entities

Models

PanGu-Coder2 (15B)StarCoder (15B)WizardCoder (15B)CodeGen-mono (16B)CodeT5+ (16B)CodeGeeX (13B)GPT-3.5GPT-4

Metrics

pass@1pass@10pass@100greedy pass@1ms/tokenGPU memory (GB)

Datasets

HumanEvalCoderEvalLeetCode (post-2022.7 subset)CodeAlpaca-20k (base for Evol-Instruct)Evol-Instruct generated dataset (68k after filtering)

Benchmarks

HumanEvalCoderEvalLeetCode