Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

January 16, 20267 min

Overview

Decision SnapshotNeeds Validation

The editorial-centric protocol and dataset are straightforward to adopt; the evidence is moderate (83 problems, 19 models) and validated by expert annotations and an LLM judge.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 50%

Authors

Sama Hadhoud, Alaa Elsetohy, Frederikus Hudi, Jan Christian Blaise Cruz, Steven Halim, Alham Fikri Aji

Links

Abstract / PDF

Why It Matters For Business

If you build or benchmark code-generation products, separate reasoning (the algorithm) from implementation: measuring both reduces misdiagnosis and surfaces whether to invest in better planners or in more robust code generation.

Who Should Care

Summary TLDR

The paper argues that competitive-programming (CP) evaluation should separate algorithmic problem solving (the idea) from implementation (the code). The authors build an editorial-centric pipeline and a dataset of 83 ICPC-style problems with expert gold editorials and judge test suites. Across 19 LLMs, they show gold editorials boost pass@1 substantially (overall +14.5 percentage points) while self-generated editorials give small or unreliable gains. Even with correct editorials, many models still fail to implement solutions correctly or efficiently, exposing an implementation bottleneck. An LLM-as-a-judge protocol reliably flags editorial correctness and scales expert-style diagnosis.

Problem Statement

Competitive programming is often evaluated end-to-end (problem → code), which mixes two skills: deriving a correct algorithm (problem solving) and turning that plan into working code (implementation). This conflation hides whether failures are due to bad reasoning or buggy/inefficient code. The authors propose making natural-language editorials an explicit intermediate artifact to separate and measure these two capabilities.

Main Contribution

Introduce an editorial-centric pipeline (problem → editorial → code) that isolates problem solving from implementation.

Curate a dataset of 83 ICPC-style problems (2017–2025) with gold editorials and full official test suites.

Key Findings

Providing gold editorials substantially raises correctness across models.

NumbersOverall pass@1: 23.2%37.7% (+14.5pp)

Practical UseWhen evaluating models, measure implementation separately by supplying correct plans; many failures are solved by giving a correct editorial.

Evidence RefTable 1, Overall Avg

Model-generated editorials rarely match the benefit of gold editorials and can be unreliable.

NumbersOverall pass@1: 23.2%23.1% (−0.1pp)

Practical UseDo not assume a model-written reasoning step will improve final code; validate editorial quality before conditioning code generation on it.

Evidence RefTable 1, Overall Avg

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
pass@1 (overall, w/oEd)23.2%83 problemsTable 1 overall average w/oEdTable 1
pass@1 (overall, w/GenEd)23.1% (−0.1pp)w/oEd 23.2%−0.1pp83 problemsTable 1 overall average w/GenEdTable 1

What To Try In 7 Days

Add an editorial (text plan) stage to your internal evaluation: generate or load canonical editorials and compare pass@1 with/without them.

Use an LLM judge (e.g., Gemini-3-Pro) to flag bad editorials before conditioning code generation.

Test writer-coder pairing: generate editorials with your best reasoning model and feed them to your standard code model to measure gains.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Dataset is small (83 problems) and drawn from seven contests; results may not generalize to other problem pools.

Evaluation uses C++ primarily; authors note Python performs worse and language sensitivity matters.

When Not To Use

When evaluating general software engineering tasks rather than contest-style algorithmic problems.

When you need large-scale, diverse benchmark coverage beyond curated ICPC-style problems.

Failure Modes

Model hallucinates constraints in editorials, causing wrong algorithms (locks conditioning into bad plan).

Models implement correct algorithms poorly: WA or subtle off-by-one bugs leading to Wrong Answer.

Core Entities

Models

GPT-5O3Gemini 2.5 ProGemini 2.5 FlashClaude Opus 4Claude Sonnet 4GPT-4.1GPT-4oGPT-OSS-120BGPT-OSS-20BDeepSeek-R1DeepSeek-V3Qwen3-8BQwen3-Coder-480B-A35BKimi-K2OlympicCoder-7BLlama-3.1-405BLlama-3.3-70BGemma-3-27B

Metrics

pass@1virtual rank percentile (contest-relative)failure breakdown (WA/TLE/RTE/CE/MLE)Accuracy

Datasets

ICPC-style 83 problems (2017-2025) with gold editorials and full official test suites

Benchmarks

Editorial-centric CP evaluation (w/oEd, w/GenEd, w/GoldEd)