RRTF trains a 15B code LLM by ranking test-and-teacher outputs; PanGu-Coder2 hits ~62% pass@1 on HumanEval

July 27, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

13

Authors

Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, Qianxiang Wang

Links

Abstract / PDF

Why It Matters For Business

RRTF provides a lower-cost, scalable way to improve code-generation correctness by using unit tests and stronger-model outputs as ranked supervision; this delivers higher-quality code models that are faster and cheaper to run after quantization.

Summary TLDR

The authors propose RRTF (Rank Responses to align Test & Teacher Feedback): sample multiple code outputs, score them by unit tests and teacher preferences, then train with a ranking loss plus supervised loss. Applied to StarCoder 15B, this produces PanGu-Coder2. On HumanEval it achieves ~61.6% pass@1 (n=200 sampling) and 62.2% greedy pass@1, outperforming prior open-source code models on HumanEval, CoderEval and a curated LeetCode set. The paper also reports inference optimizations (FlashAttention, quantization) and emphasizes dataset size and 3–4 training epochs as key factors.

Problem Statement

RL approaches that directly use unit-test results are slow, unstable, and costly for large code models. The paper proposes a simpler, data-efficient alternative that ranks candidate programs (by tests and stronger 'teacher' outputs) and fine-tunes a code model using ranking and cross-entropy losses to boost functional correctness.

Main Contribution

Propose RRTF, a ranking-feedback fine-tuning pipeline that uses unit tests and teacher-model outputs as ranked supervision.

Produce PanGu-Coder2 (15B) by applying RRTF to StarCoder 15B, improving pass@1 on HumanEval to ~61.6% (n=200) and 62.2% (greedy).

Show practical deployment steps: dataset construction via Evol-Instruct (down to 68k pairs), training details (global batch 512, 6 epochs), and inference optimizations including quantization.

Key Findings

PanGu-Coder2 achieves state-of-the-art pass@1 among open-source models on HumanEval.

Numberspass@1=61.64% (n=200 sampling); greedy pass@1=62.20%

PanGu-Coder2 improves over the previous best open model (WizardCoder) on HumanEval.

Numbers61.64% vs WizardCoder 57.30% pass@1 (relative +4.34% abs)

Quantization and engine optimizations cut GPU memory and speed up inference substantially.

Numbersfloat16 32.36GB & 75 ms/token -> int8 (CTranslate2) 16.29GB & 33 ms/token; HumanEval score reported 64.63% (CTranslate2)

Training benefits from larger, varied instruction-solution data and converges in ~3–4 epochs.

Numbersstable best performance after ~3 epochs; dataset sizes tested: 18k, 38k, 68k

Results

HumanEval pass@1 (n=200 sampling)

ValuePanGu-Coder2 61.64%

BaselineWizardCoder 57.30%

HumanEval greedy pass@1

ValuePanGu-Coder2 62.20%

BaselineStarCoder 32.93%

CoderEval greedy pass@1

ValuePanGu-Coder2 38.26%

BaselineWizardCoder 33.48%

LeetCode greedy (easy/medium/hard solved counts)

ValuePanGu-Coder2 32 / 30 / 10

BaselineWizardCoder 29 / 22 / 7

Inference memory & speed (float16 -> int8 CTranslate2)

Value32.36GB & 75 ms/token -> 16.29GB & 33 ms/token

BaselinePanGu-Coder2 float16

Who Should Care

What To Try In 7 Days

Replicate a small RRTF loop: sample student+teacher outputs on ~100 problems, rank by unit tests, fine-tune with ranking loss.

Build a parallel unit-test runner to label outputs as compile/runtime/partial/all-pass and use that ranking for supervised selection.

Quantize a finalist model with CTranslate2 int8 and measure memory and latency vs float16 on representative prompts.

Optimization Features

Token Efficiency

  • greedy decoding recommended for practical evaluation

Infra Optimization

  • use of efficient attention and quantization reduces GPU memory footprint

Model Optimization

  • decoder-only Transformer with Multi-Query-Attention
  • learned absolute positional embeddings
  • max context length 8192

System Optimization

  • parallel offline sampling and large-scale parallel test execution

Training Optimization

  • RRTF ranking loss plus cross-entropy on teacher output
  • Evol-Instruct to expand instruction-solution pairs
  • global batch size 512; trained 6 epochs on 15B model

Inference Optimization

  • FlashAttention to reduce compute and memory
  • CTranslate2 int8 quantization for faster, smaller inference
  • GPTQ-based quantization experiments

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Ranking depends on unit tests and teacher quality; weak tests or biased teachers limit gains.
  • Reported dataset (68k) and model checkpoints are not publicly released here, hindering exact reproduction.
  • Improvements shown on standard benchmarks; real-world codebases may differ in distribution and safety needs.

When Not To Use

  • If you lack reliable unit tests to rank outputs.
  • When you need instruction-following with verbose human-style comments rather than strict functional correctness.
  • On very small models where sampling and ranking costs outweigh fine-tuning benefits.

Failure Modes

  • Teacher outputs can bias the student toward teacher mistakes when teacher score filtering is imperfect.
  • Ranking ties resolved in favor of teachers may reduce diversity and miss novel correct solutions.
  • Quantization (GPTQ) sometimes degraded quality; verify on your test-suite.

Core Entities

Models

  • PanGu-Coder2 (15B)
  • StarCoder (15B)
  • WizardCoder (15B)
  • CodeGen-mono (16B)
  • CodeT5+ (16B)
  • CodeGeeX (13B)
  • GPT-3.5
  • GPT-4

Metrics

  • pass@1
  • pass@10
  • pass@100
  • greedy pass@1
  • ms/token
  • GPU memory (GB)

Datasets

  • HumanEval
  • CoderEval
  • LeetCode (post-2022.7 subset)
  • CodeAlpaca-20k (base for Evol-Instruct)
  • Evol-Instruct generated dataset (68k after filtering)

Benchmarks

  • HumanEval
  • CoderEval
  • LeetCode