Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
17
Why It Matters For Business
SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.
Summary TLDR
The paper introduces Structured Chain-of-Thought (SCoT): ask an LLM to first produce a short, structured plan using program constructs (sequence, branch, loop, plus input/output signature) and then generate code from that plan. On three benchmarks (HumanEval, MBPP, MBCPP) and two LLMs (ChatGPT, Codex), SCoT prompting raises strict-pass accuracy (Pass@1) substantially versus standard Chain-of-Thought (CoT) prompts (up to +13.79% relative). Human reviewers also prefer SCoT outputs for correctness, smell, and maintainability. SCoT is simple to apply (prompt templates + examples) and robust to different example seeds and writing styles.
Problem Statement
Chain-of-Thought (CoT) prompting asks models to write natural-language reasoning before answers, but CoT was designed for text reasoning and yields only small gains for code. Code is inherently structured; forcing LLMs to reason in program structures may produce clearer plans and better code. The paper asks: does a structure-guided intermediate step improve automated code generation accuracy and quality?
Main Contribution
Define Structured Chain-of-Thought (SCoT): intermediate plans built from program structures (sequence, branch, loop) plus an input-output signature.
Design SCoT prompting: two-step prompts (generate SCoT, then generate code from SCoT) and provide prompt templates and example triples.
Run large-scale evaluation on HumanEval, MBPP, and MBCPP with ChatGPT and Codex, showing consistent Pass@k improvements over CoT and few-shot baselines.
Ablations and human studies that attribute gains to the program-structure constraints and show robustness to different example seeds and writing styles.
Key Findings
SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.
SCoT improves Pass@1 on MBPP and MBCPP too.
Human reviewers prefer SCoT-generated programs on correctness, code smell, and maintainability.
Program-structure constraints drive most gains; removing them lowers performance noticeably.
SCoT prompting is robust to choice and writing style of example seeds.
Results
HumanEval Pass@1 (ChatGPT)
MBPP Pass@1 (ChatGPT)
MBCPP Pass@1 (ChatGPT)
Human evaluation (mean scores)
Ablation: remove basic structures (HumanEval Pass@1)
Who Should Care
What To Try In 7 Days
Add a short SCoT stage to your prompt flow: include Input/Output, then sequence/branch/loop bullets.
Run quick A/B on a few internal coding tasks to compare Pass@1 and developer preference.
Combine SCoT with your existing reranking or test-based selection pipeline to compound gains.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires manually written SCoT examples for prompt seeds.
- Evaluations use ChatGPT and Codex; results may vary on other models.
- Does not replace reranking/test-execution techniques; complementary but not compared head-to-head in all settings.
- Possible data leakage in LLM training data cannot be fully ruled out.
When Not To Use
- When you can fully execute and validate candidates and prefer heavy reranking pipelines instead.
- For non-code tasks where program structures are irrelevant.
- If you cannot afford the two-step generation latency in interactive settings.
Failure Modes
- Generated SCoT contains errors or misses steps, which can produce wrong code.
- Ambiguous SCoTs (e.g., unclear scope of loops) lead to implementation bugs.
- Overly abstract SCoTs may omit needed implementation details and reduce pass rates.
Core Entities
Models
- ChatGPT (gpt-3.5-turbo-0301)
- Codex (code-davinci-002)
Metrics
- Pass@k (unbiased)
Datasets
- HumanEval
- MBPP
- MBCPP
Benchmarks
- HumanEval
- MBPP
- MBCPP
Context Entities
Models
- CodeGen
- StarCoder

