Overview
The editorial-centric protocol and dataset are straightforward to adopt; the evidence is moderate (83 problems, 19 models) and validated by expert annotations and an LLM judge.
Citations0
Evidence Strength0.70
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
If you build or benchmark code-generation products, separate reasoning (the algorithm) from implementation: measuring both reduces misdiagnosis and surfaces whether to invest in better planners or in more robust code generation.
Who Should Care
Summary TLDR
The paper argues that competitive-programming (CP) evaluation should separate algorithmic problem solving (the idea) from implementation (the code). The authors build an editorial-centric pipeline and a dataset of 83 ICPC-style problems with expert gold editorials and judge test suites. Across 19 LLMs, they show gold editorials boost pass@1 substantially (overall +14.5 percentage points) while self-generated editorials give small or unreliable gains. Even with correct editorials, many models still fail to implement solutions correctly or efficiently, exposing an implementation bottleneck. An LLM-as-a-judge protocol reliably flags editorial correctness and scales expert-style diagnosis.
Problem Statement
Competitive programming is often evaluated end-to-end (problem → code), which mixes two skills: deriving a correct algorithm (problem solving) and turning that plan into working code (implementation). This conflation hides whether failures are due to bad reasoning or buggy/inefficient code. The authors propose making natural-language editorials an explicit intermediate artifact to separate and measure these two capabilities.
Main Contribution
Introduce an editorial-centric pipeline (problem → editorial → code) that isolates problem solving from implementation.
Curate a dataset of 83 ICPC-style problems (2017–2025) with gold editorials and full official test suites.
Key Findings
Providing gold editorials substantially raises correctness across models.
Model-generated editorials rarely match the benefit of gold editorials and can be unreliable.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| pass@1 (overall, w/oEd) | 23.2% | — | — | 83 problems | Table 1 overall average w/oEd | Table 1 |
| pass@1 (overall, w/GenEd) | 23.1% (−0.1pp) | w/oEd 23.2% | −0.1pp | 83 problems | Table 1 overall average w/GenEd | Table 1 |
What To Try In 7 Days
Add an editorial (text plan) stage to your internal evaluation: generate or load canonical editorials and compare pass@1 with/without them.
Use an LLM judge (e.g., Gemini-3-Pro) to flag bad editorials before conditioning code generation.
Test writer-coder pairing: generate editorials with your best reasoning model and feed them to your standard code model to measure gains.
Reproducibility
Risks & Boundaries
Limitations
Dataset is small (83 problems) and drawn from seven contests; results may not generalize to other problem pools.
Evaluation uses C++ primarily; authors note Python performs worse and language sensitivity matters.
When Not To Use
When evaluating general software engineering tasks rather than contest-style algorithmic problems.
When you need large-scale, diverse benchmark coverage beyond curated ICPC-style problems.
Failure Modes
Model hallucinates constraints in editorials, causing wrong algorithms (locks conditioning into bad plan).
Models implement correct algorithms poorly: WA or subtle off-by-one bugs leading to Wrong Answer.

