Make LLMs think in program structures to improve code generation

May 11, 20237 min

Overview

Decision SnapshotNeeds Validation

SCoT is an easy prompt-level change with consistent benchmark and human-evaluation gains; evidence covers two major LLMs and three benchmarks but lacks tests on many model families.

Citations17

Evidence Strength0.70

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Jia Li, Ge Li, Yongmin Li, Zhi Jin

Links

Abstract / PDF

Why It Matters For Business

SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.

Who Should Care

Summary TLDR

The paper introduces Structured Chain-of-Thought (SCoT): ask an LLM to first produce a short, structured plan using program constructs (sequence, branch, loop, plus input/output signature) and then generate code from that plan. On three benchmarks (HumanEval, MBPP, MBCPP) and two LLMs (ChatGPT, Codex), SCoT prompting raises strict-pass accuracy (Pass@1) substantially versus standard Chain-of-Thought (CoT) prompts (up to +13.79% relative). Human reviewers also prefer SCoT outputs for correctness, smell, and maintainability. SCoT is simple to apply (prompt templates + examples) and robust to different example seeds and writing styles.

Problem Statement

Chain-of-Thought (CoT) prompting asks models to write natural-language reasoning before answers, but CoT was designed for text reasoning and yields only small gains for code. Code is inherently structured; forcing LLMs to reason in program structures may produce clearer plans and better code. The paper asks: does a structure-guided intermediate step improve automated code generation accuracy and quality?

Main Contribution

Define Structured Chain-of-Thought (SCoT): intermediate plans built from program structures (sequence, branch, loop) plus an input-output signature.

Design SCoT prompting: two-step prompts (generate SCoT, then generate code from SCoT) and provide prompt templates and example triples.

Key Findings

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

NumbersPass@1 +13.79% (CoT 53.29 → SCoT 60.64)

Practical UseIf you use ChatGPT-style models, add SCoT prompts to raise one-shot correctness on HumanEval-like tasks; implement SCoT as a short program-structured plan before code generation.

Evidence RefTable 2 (ChatGPT block)

SCoT improves Pass@1 on MBPP and MBCPP too.

NumbersMBPP Pass@1 +12.31% (41.8346.98); MBCPP Pass@1 +6.63% (53.5157.06)

Practical UseSCoT helps across Python and C++ benchmarks; apply to multi-language code suggestions to get measurable accuracy gains.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HumanEval Pass@1 (ChatGPT)SCoT 60.64% vs CoT 53.29%CoT prompting+7.35 pp (+13.79% relative)HumanEvalTable 2 reports ChatGPT Pass@1Table 2
MBPP Pass@1 (ChatGPT)SCoT 46.98% vs CoT 41.83%CoT prompting+5.15 pp (+12.31% relative)MBPPTable 2 reports ChatGPT Pass@1Table 2

What To Try In 7 Days

Add a short SCoT stage to your prompt flow: include Input/Output, then sequence/branch/loop bullets.

Run quick A/B on a few internal coding tasks to compare Pass@1 and developer preference.

Combine SCoT with your existing reranking or test-based selection pipeline to compound gains.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Requires manually written SCoT examples for prompt seeds.

Evaluations use ChatGPT and Codex; results may vary on other models.

When Not To Use

When you can fully execute and validate candidates and prefer heavy reranking pipelines instead.

For non-code tasks where program structures are irrelevant.

Failure Modes

Generated SCoT contains errors or misses steps, which can produce wrong code.

Ambiguous SCoTs (e.g., unclear scope of loops) lead to implementation bugs.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-0301)Codex (code-davinci-002)

Metrics

Pass@k (unbiased)

Datasets

HumanEvalMBPPMBCPP

Benchmarks

HumanEvalMBPPMBCPP

Context Entities

Models

CodeGenStarCoder