Make LLMs think in program structures to improve code generation

May 11, 20237 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

17

Authors

Jia Li, Ge Li, Yongmin Li, Zhi Jin

Links

Abstract / PDF

Why It Matters For Business

SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.

Summary TLDR

The paper introduces Structured Chain-of-Thought (SCoT): ask an LLM to first produce a short, structured plan using program constructs (sequence, branch, loop, plus input/output signature) and then generate code from that plan. On three benchmarks (HumanEval, MBPP, MBCPP) and two LLMs (ChatGPT, Codex), SCoT prompting raises strict-pass accuracy (Pass@1) substantially versus standard Chain-of-Thought (CoT) prompts (up to +13.79% relative). Human reviewers also prefer SCoT outputs for correctness, smell, and maintainability. SCoT is simple to apply (prompt templates + examples) and robust to different example seeds and writing styles.

Problem Statement

Chain-of-Thought (CoT) prompting asks models to write natural-language reasoning before answers, but CoT was designed for text reasoning and yields only small gains for code. Code is inherently structured; forcing LLMs to reason in program structures may produce clearer plans and better code. The paper asks: does a structure-guided intermediate step improve automated code generation accuracy and quality?

Main Contribution

Define Structured Chain-of-Thought (SCoT): intermediate plans built from program structures (sequence, branch, loop) plus an input-output signature.

Design SCoT prompting: two-step prompts (generate SCoT, then generate code from SCoT) and provide prompt templates and example triples.

Run large-scale evaluation on HumanEval, MBPP, and MBCPP with ChatGPT and Codex, showing consistent Pass@k improvements over CoT and few-shot baselines.

Ablations and human studies that attribute gains to the program-structure constraints and show robustness to different example seeds and writing styles.

Key Findings

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

NumbersPass@1 +13.79% (CoT 53.29 → SCoT 60.64)

SCoT improves Pass@1 on MBPP and MBCPP too.

NumbersMBPP Pass@1 +12.31% (41.83 → 46.98); MBCPP Pass@1 +6.63% (53.51 → 57.06)

Human reviewers prefer SCoT-generated programs on correctness, code smell, and maintainability.

NumbersCorrectness +15.27%; Code smell +10.66%; Maintainability +15.90%

Program-structure constraints drive most gains; removing them lowers performance noticeably.

NumbersHumanEval Pass@1 drops from 60.64 → 55.67 without basic structures (−4.97 abs)

SCoT prompting is robust to choice and writing style of example seeds.

NumbersPass@1 varies little across seeds/annotators; SCoT remains ~59–61 vs CoT ~51–53

Results

HumanEval Pass@1 (ChatGPT)

ValueSCoT 60.64% vs CoT 53.29%

BaselineCoT prompting

MBPP Pass@1 (ChatGPT)

ValueSCoT 46.98% vs CoT 41.83%

BaselineCoT prompting

MBCPP Pass@1 (ChatGPT)

ValueSCoT 57.06% vs CoT 53.51%

BaselineCoT prompting

Human evaluation (mean scores)

ValueSCoT correctness 1.412 vs CoT 1.225

BaselineCoT prompting

Ablation: remove basic structures (HumanEval Pass@1)

Valuew/o basic structures 55.67% vs SCoT 60.64%

BaselineSCoT prompting

Who Should Care

What To Try In 7 Days

Add a short SCoT stage to your prompt flow: include Input/Output, then sequence/branch/loop bullets.

Run quick A/B on a few internal coding tasks to compare Pass@1 and developer preference.

Combine SCoT with your existing reranking or test-based selection pipeline to compound gains.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires manually written SCoT examples for prompt seeds.
  • Evaluations use ChatGPT and Codex; results may vary on other models.
  • Does not replace reranking/test-execution techniques; complementary but not compared head-to-head in all settings.
  • Possible data leakage in LLM training data cannot be fully ruled out.

When Not To Use

  • When you can fully execute and validate candidates and prefer heavy reranking pipelines instead.
  • For non-code tasks where program structures are irrelevant.
  • If you cannot afford the two-step generation latency in interactive settings.

Failure Modes

  • Generated SCoT contains errors or misses steps, which can produce wrong code.
  • Ambiguous SCoTs (e.g., unclear scope of loops) lead to implementation bugs.
  • Overly abstract SCoTs may omit needed implementation details and reduce pass rates.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo-0301)
  • Codex (code-davinci-002)

Metrics

  • Pass@k (unbiased)

Datasets

  • HumanEval
  • MBPP
  • MBCPP

Benchmarks

  • HumanEval
  • MBPP
  • MBCPP

Context Entities

Models

  • CodeGen
  • StarCoder