RRTF trains a 15B code LLM by ranking test-and-teacher outputs; PanGu-Coder2 hits ~62% pass@1 on HumanEval

Overview

Decision SnapshotReady For Pilot

The method is practical and reproducible at scale: ranking avoids PPO complexity, yields strong benchmark gains, and pairs well with standard quantization for deployment; however full reproduction needs code/data release.

Citations13

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, Qianxiang Wang

Links

Abstract / PDF

Why It Matters For Business

RRTF provides a lower-cost, scalable way to improve code-generation correctness by using unit tests and stronger-model outputs as ranked supervision; this delivers higher-quality code models that are faster and cheaper to run after quantization.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

The authors propose RRTF (Rank Responses to align Test & Teacher Feedback): sample multiple code outputs, score them by unit tests and teacher preferences, then train with a ranking loss plus supervised loss. Applied to StarCoder 15B, this produces PanGu-Coder2. On HumanEval it achieves ~61.6% pass@1 (n=200 sampling) and 62.2% greedy pass@1, outperforming prior open-source code models on HumanEval, CoderEval and a curated LeetCode set. The paper also reports inference optimizations (FlashAttention, quantization) and emphasizes dataset size and 3–4 training epochs as key factors.

Problem Statement

RL approaches that directly use unit-test results are slow, unstable, and costly for large code models. The paper proposes a simpler, data-efficient alternative that ranks candidate programs (by tests and stronger 'teacher' outputs) and fine-tunes a code model using ranking and cross-entropy losses to boost functional correctness.

Main Contribution

Propose RRTF, a ranking-feedback fine-tuning pipeline that uses unit tests and teacher-model outputs as ranked supervision.

Produce PanGu-Coder2 (15B) by applying RRTF to StarCoder 15B, improving pass@1 on HumanEval to ~61.6% (n=200) and 62.2% (greedy).

Key Findings

PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.

Numberspass@1=61.64% (n=200 sampling); greedy pass@1=62.20%

Practical UseIf you need a top open-source code model for functional correctness, try RRTF-finetuned 15B models and evaluate with greedy decoding first.

Evidence RefTable 2 and Table 3

PanGu-Coder2 improves over the previous best open model (WizardCoder) on HumanEval.

Numbers61.64% vs WizardCoder 57.30% pass@1 (relative +4.34% abs)

Practical UseRanking-based fine-tuning can yield measurable gains over pure instruction-tuning on code benchmarks.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HumanEval pass@1 (n=200 sampling)	PanGu-Coder2 61.64%	WizardCoder 57.30%	+4.34% abs	HumanEval	Table 2 reports pass@k with n=200 sampling and temperatures	Table 2
HumanEval greedy pass@1	PanGu-Coder2 62.20%	StarCoder 32.93%	+29.27% abs	HumanEval (greedy)	Table 3 compares greedy decoding pass@1 across models	Table 3

What To Try In 7 Days

Replicate a small RRTF loop: sample student+teacher outputs on ~100 problems, rank by unit tests, fine-tune with ranking loss.

Build a parallel unit-test runner to label outputs as compile/runtime/partial/all-pass and use that ranking for supervised selection.

Quantize a finalist model with CTranslate2 int8 and measure memory and latency vs float16 on representative prompts.

Optimization Features

Token Efficiency

greedy decoding recommended for practical evaluation

Infra Optimization

use of efficient attention and quantization reduces GPU memory footprint

Model Optimization

decoder-only Transformer with Multi-Query-Attentionlearned absolute positional embeddingsmax context length 8192

System Optimization

parallel offline sampling and large-scale parallel test execution

Training Optimization

RRTF ranking loss plus cross-entropy on teacher outputEvol-Instruct to expand instruction-solution pairsglobal batch size 512; trained 6 epochs on 15B model

Inference Optimization

FlashAttention to reduce compute and memoryCTranslate2 int8 quantization for faster, smaller inferenceGPTQ-based quantization experiments

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Ranking depends on unit tests and teacher quality; weak tests or biased teachers limit gains.

Reported dataset (68k) and model checkpoints are not publicly released here, hindering exact reproduction.

When Not To Use

If you lack reliable unit tests to rank outputs.

When you need instruction-following with verbose human-style comments rather than strict functional correctness.

Failure Modes

Teacher outputs can bias the student toward teacher mistakes when teacher score filtering is imperfect.

Ranking ties resolved in favor of teachers may reduce diversity and miss novel correct solutions.

Core Entities

Models

PanGu-Coder2 (15B)StarCoder (15B)WizardCoder (15B)CodeGen-mono (16B)CodeT5+ (16B)CodeGeeX (13B)GPT-3.5GPT-4

Metrics

pass@1pass@10pass@100greedy pass@1ms/tokenGPU memory (GB)

Datasets

HumanEvalCoderEvalLeetCode (post-2022.7 subset)CodeAlpaca-20k (base for Evol-Instruct)Evol-Instruct generated dataset (68k after filtering)

Benchmarks

HumanEvalCoderEvalLeetCode

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.

PanGu-Coder2 improves over the previous best open model (WizardCoder) on HumanEval.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding