Make LLMs think in program structures to improve code generation

Overview

Decision SnapshotNeeds Validation

SCoT is an easy prompt-level change with consistent benchmark and human-evaluation gains; evidence covers two major LLMs and three benchmarks but lacks tests on many model families.

Citations17

Evidence Strength0.70

Confidence0.88

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Jia Li, Ge Li, Yongmin Li, Zhi Jin

Links

Abstract / PDF

Why It Matters For Business

SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Data Scientist Founder

Summary TLDR

The paper introduces Structured Chain-of-Thought (SCoT): ask an LLM to first produce a short, structured plan using program constructs (sequence, branch, loop, plus input/output signature) and then generate code from that plan. On three benchmarks (HumanEval, MBPP, MBCPP) and two LLMs (ChatGPT, Codex), SCoT prompting raises strict-pass accuracy (Pass@1) substantially versus standard Chain-of-Thought (CoT) prompts (up to +13.79% relative). Human reviewers also prefer SCoT outputs for correctness, smell, and maintainability. SCoT is simple to apply (prompt templates + examples) and robust to different example seeds and writing styles.

Problem Statement

Chain-of-Thought (CoT) prompting asks models to write natural-language reasoning before answers, but CoT was designed for text reasoning and yields only small gains for code. Code is inherently structured; forcing LLMs to reason in program structures may produce clearer plans and better code. The paper asks: does a structure-guided intermediate step improve automated code generation accuracy and quality?

Main Contribution

Define Structured Chain-of-Thought (SCoT): intermediate plans built from program structures (sequence, branch, loop) plus an input-output signature.

Design SCoT prompting: two-step prompts (generate SCoT, then generate code from SCoT) and provide prompt templates and example triples.

Key Findings

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

NumbersPass@1 +13.79% (CoT 53.29 → SCoT 60.64)

Practical UseIf you use ChatGPT-style models, add SCoT prompts to raise one-shot correctness on HumanEval-like tasks; implement SCoT as a short program-structured plan before code generation.

Evidence RefTable 2 (ChatGPT block)

SCoT improves Pass@1 on MBPP and MBCPP too.

NumbersMBPP Pass@1 +12.31% (41.83 → 46.98); MBCPP Pass@1 +6.63% (53.51 → 57.06)

Practical UseSCoT helps across Python and C++ benchmarks; apply to multi-language code suggestions to get measurable accuracy gains.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HumanEval Pass@1 (ChatGPT)	SCoT 60.64% vs CoT 53.29%	CoT prompting	+7.35 pp (+13.79% relative)	HumanEval	Table 2 reports ChatGPT Pass@1	Table 2
MBPP Pass@1 (ChatGPT)	SCoT 46.98% vs CoT 41.83%	CoT prompting	+5.15 pp (+12.31% relative)	MBPP	Table 2 reports ChatGPT Pass@1	Table 2

What To Try In 7 Days

Add a short SCoT stage to your prompt flow: include Input/Output, then sequence/branch/loop bullets.

Run quick A/B on a few internal coding tasks to compare Pass@1 and developer preference.

Combine SCoT with your existing reranking or test-based selection pipeline to compound gains.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Requires manually written SCoT examples for prompt seeds.

Evaluations use ChatGPT and Codex; results may vary on other models.

When Not To Use

When you can fully execute and validate candidates and prefer heavy reranking pipelines instead.

For non-code tasks where program structures are irrelevant.

Failure Modes

Generated SCoT contains errors or misses steps, which can produce wrong code.

Ambiguous SCoTs (e.g., unclear scope of loops) lead to implementation bugs.

Core Entities

Models

ChatGPT (gpt-3.5-turbo-0301)Codex (code-davinci-002)

Metrics

Pass@k (unbiased)

Datasets

HumanEvalMBPPMBCPP

Benchmarks

HumanEvalMBPPMBCPP

Context Entities

Models

CodeGenStarCoder

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

SCoT improves Pass@1 on MBPP and MBCPP too.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

RL fine-tuning raises visual reasoning scores but weakens reasoning faithfulness and robustness to misleading text

Key finding

Teach small models to judge their own chain-of-thoughts and learn from multiple reasoning paths

Key finding

Build expert element-based test sets and use a chain-of-thought prompt (SumCoT) to get LLMs to write more complete news summaries

Key finding

Which LLM and reasoning setup solves Raven-style visual puzzles best?

Key finding

Embed executable code in prompts to ground LLM reasoning and cut hallucinations

Key finding