QualityFlow: use an LLM 'imagined execution' checker to keep correct code and reach SOTA on code benchmarks

January 20, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

2

Authors

Yaojie Hu, Qiang Zhou, Qihong Chen, Xiaopeng Li, Linbo Liu, Dejiao Zhang, Amit Kachroo, Talha Oz, Omer Tripp

Links

Abstract / PDF

Why It Matters For Business

Adding a reliable LLM-based checker can raise correct-code yield substantially and reduce wasted debugging cycles; this reduces human review time and increases throughput for code-synthesis features.

Summary TLDR

QualityFlow is a multi-agent pipeline for code generation that adds a controller agent (Quality Checker) which uses an LLM to "imagine" program execution step-by-step and accept only programs that match visible unit tests. The workflow also generates tests, self-debuggs, clarifies misunderstandings, and can revert to the original draft. On MBPP and HumanEval variants, QualityFlow reaches new or competitive state-of-the-art pass@1 results (e.g., MBPP 94.2% with Sonnet). Key strengths: very high quality-check precision/recall, filters bad synthesized tests, and improves final accuracy while limiting harmful self-debugging edits.

Problem Statement

Given a natural-language programming problem and a set of visible unit tests, generate a correct program that passes the tests. Current multi-agent/self-debug approaches assume visible tests are reliable and may be misled by incorrectly synthesized tests or wander into damaging edit loops. The paper proposes a controller that predicts test conformity and directs the workflow to accept, continue, clarify, or revert.

Main Contribution

QualityFlow: a dynamic agentic workflow that coordinates code generation, test design, self-debugging, clarification and a central Quality Checker.

Imagined Execution: a chain-of-thought prompting method where an LLM emulates program execution on tests to decide correctness.

Test Quality Checker: an LLM-based filter that rejects incorrect synthesized unit tests to reduce bad self-debugging feedback.

Diversified Prompting: run multiple prompt variants in parallel and rely on the Quality Checker to select a correct candidate.

Key Findings

QualityFlow reaches 94.2% pass@1 on MBPP with Sonnet LLM, a +4.8% absolute improvement over prior reported SOTA.

NumbersMBPP pass@1 = 94.2% (QualityFlow Sonnet); prior SOTA 89.4% (Table 2)

The Code Quality Checker (Imagined Execution) is highly accurate: ~98% precision and ~98% recall on MBPP for accepted programs.

NumbersPrecision ≈98%, Recall ≈98% on MBPP (Table 4)

Imagined Execution outperforms a naive yes/no critic and is critical for workflow gains: replacing it drops workflow pass@1 from 94.2% to 78.8% on MBPP.

NumbersWorkflow pass@1: 94.2% (Imagined Execution) vs 78.8% (Yes/No baseline) on MBPP (Table 5)

LLM test synthesis is noisy: ~62% of LLM-synthesized tests are incorrect on MBPP (Sonnet), but the Test Quality Checker recalls ~79% of those incorrect tests.

Numbers62.25% synthesized tests incorrect; TQC recall 79.13% (MBPP, Sonnet) (Table 6)

Test Quality Checker (TQC) gives modest overall gains and can hurt with weaker LLMs: e.g., +0.8% on MBPP with Sonnet, but can reduce performance with Opus.

NumbersMBPP +0.8% with Sonnet; Opus settings sometimes drop (Table 7)

Results

pass@1 (MBPP, Sonnet)

Value94.2%

BaselineDeepSeek 89.4%

pass@1 (MBPP-EvalPlus, Sonnet)

Value79.89%

BaselineDeepSeek 76.2%

pass@1 (HumanEval, Sonnet)

Value97.56% (standard), 98.78% (relaxed)

BaselineLDB 98.2% (prior reported SOTA)

Code Quality Checker (precision/recall)

ValuePrecision ≈98%, Recall ≈98% (MBPP)

Synthesized-test error rate

Value≈62% incorrect tests (MBPP, Sonnet)

Test Quality Checker (recall on incorrect tests)

Value≈79% recall (MBPP, Sonnet)

Impact of removing CQC

Valuepass@1 drops ~14% on MBPP

BaselineQualityFlow full

Who Should Care

What To Try In 7 Days

Run a simple emulated-execution prompt (chain-of-thought) to validate a subset of generated functions against visible tests.

Add a validator that rejects generated unit tests before automated debugging and measure net pass@1 change on a held-out problem set.

Experiment with 3–6 prompt variants (diversified prompting) and pick the accepted candidate using the imagined-execution checker.

Agent Features

Memory

  • short-term context per workflow attempt (tests, debug traces)

Planning

  • dynamic control flow: accept/continue/clarify/revert
  • revert-to-original when trajectory fails

Tool Use

  • Python interpreter (relaxed setting)
  • LLM chain-of-thought for imagined execution

Frameworks

  • Diversified Prompting
  • Imagined Execution

Is Agentic

true

Architectures

  • multi-agent workflow
  • parallel diversified generators

Collaboration

  • Code Generator
  • Test Designer
  • Self-Debugger
  • Problem Clarifier
  • Quality Checker (controller)

Optimization Features

Token Efficiency

  • diversified prompts increase token use but may save later steps when accepted

Inference Optimization

  • early accept to skip later expensive agents

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations use Claude family LLMs; results may not transfer to weaker/backend models.
  • Requires visible unit tests; not applicable when tests are unavailable or sparse.
  • Test Quality Checker is imperfect (≈80% recall) and can sometimes reduce accuracy with weaker LLMs.
  • Imagined Execution can itself make reasoning errors and misclassify some programs (examples shown).

When Not To Use

  • When no reliable unit tests exist for the target problems.
  • When only weak LLM backends are available (TQC may hurt performance).
  • When latency or token cost must be minimal and extra agent steps are unaffordable.

Failure Modes

  • Imagined Execution produces incorrect emulated outputs and accepts/rejects wrongly (paper shows examples).
  • Synthesized tests are incorrect and, if not filtered, mislead self-debugging into worse code.
  • Clarifier/Restart may not recover when initial generator and debugger jointly converge to a wrong interpretation.

Core Entities

Models

  • Claude Sonnet-3.5-v2
  • Claude Opus-3
  • DeepSeek (as prior baseline)

Metrics

  • pass@1
  • pass@5
  • Accuracy
  • precision
  • recall

Datasets

  • MBPP
  • HumanEval
  • MBPP-EvalPlus
  • HumanEval-EvalPlus

Benchmarks

  • MBPP
  • HumanEval
  • MBPP-EvalPlus
  • HumanEval-EvalPlus