QualityFlow: use an LLM 'imagined execution' checker to keep correct code and reach SOTA on code benchmarks

January 20, 20258 min

Overview

Decision SnapshotReady For Pilot

Results are measured on standard public benchmarks with ablations and confusion matrices; gains depend on using strong LLM backends and visible unit tests.

Citations2

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 70%

Authors

Yaojie Hu, Qiang Zhou, Qihong Chen, Xiaopeng Li, Linbo Liu, Dejiao Zhang, Amit Kachroo, Talha Oz, Omer Tripp

Links

Abstract / PDF

Why It Matters For Business

Adding a reliable LLM-based checker can raise correct-code yield substantially and reduce wasted debugging cycles; this reduces human review time and increases throughput for code-synthesis features.

Who Should Care

Summary TLDR

QualityFlow is a multi-agent pipeline for code generation that adds a controller agent (Quality Checker) which uses an LLM to "imagine" program execution step-by-step and accept only programs that match visible unit tests. The workflow also generates tests, self-debuggs, clarifies misunderstandings, and can revert to the original draft. On MBPP and HumanEval variants, QualityFlow reaches new or competitive state-of-the-art pass@1 results (e.g., MBPP 94.2% with Sonnet). Key strengths: very high quality-check precision/recall, filters bad synthesized tests, and improves final accuracy while limiting harmful self-debugging edits.

Problem Statement

Given a natural-language programming problem and a set of visible unit tests, generate a correct program that passes the tests. Current multi-agent/self-debug approaches assume visible tests are reliable and may be misled by incorrectly synthesized tests or wander into damaging edit loops. The paper proposes a controller that predicts test conformity and directs the workflow to accept, continue, clarify, or revert.

Main Contribution

QualityFlow: a dynamic agentic workflow that coordinates code generation, test design, self-debugging, clarification and a central Quality Checker.

Imagined Execution: a chain-of-thought prompting method where an LLM emulates program execution on tests to decide correctness.

Key Findings

QualityFlow reaches 94.2% pass@1 on MBPP with Sonnet LLM, a +4.8% absolute improvement over prior reported SOTA.

NumbersMBPP pass@1 = 94.2% (QualityFlow Sonnet); prior SOTA 89.4% (Table 2)

Practical UseIf you run a multi-agent code pipeline, adding a high-quality LLM checker can substantially raise single-shot pass@1 on MBPP-scale tasks.

Evidence RefTable 2

The Code Quality Checker (Imagined Execution) is highly accurate: ~98% precision and ~98% recall on MBPP for accepted programs.

NumbersPrecision ≈98%, Recall ≈98% on MBPP (Table 4)

Practical UseUsing Chain-of-Thought emulated execution to validate visible tests is reliable enough to let you accept correct outputs early and skip risky self-debug cycles.

Evidence RefTable 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
pass@1 (MBPP, Sonnet)94.2%DeepSeek 89.4%+4.8%MBPPQualityFlow (Sonnet) 94.2% vs prior SOTA 89.4% (Table 2)Table 2
pass@1 (MBPP-EvalPlus, Sonnet)79.89%DeepSeek 76.2%+3.7%MBPP-EvalPlusQualityFlow (Sonnet) 79.89% vs prior SOTA 76.2% (Table 2)Table 2

What To Try In 7 Days

Run a simple emulated-execution prompt (chain-of-thought) to validate a subset of generated functions against visible tests.

Add a validator that rejects generated unit tests before automated debugging and measure net pass@1 change on a held-out problem set.

Experiment with 3–6 prompt variants (diversified prompting) and pick the accepted candidate using the imagined-execution checker.

Agent Features

Memory
short-term context per workflow attempt (tests, debug traces)
Planning
dynamic control flow: accept/continue/clarify/revertrevert-to-original when trajectory fails
Tool Use
Python interpreter (relaxed setting)LLM chain-of-thought for imagined execution
Frameworks
Diversified PromptingImagined Execution
Is Agentic

Yes

Architectures
multi-agent workflowparallel diversified generators
Collaboration
Code GeneratorTest DesignerSelf-DebuggerProblem ClarifierQuality Checker (controller)

Optimization Features

Token Efficiency
diversified prompts increase token use but may save later steps when accepted
Inference Optimization
early accept to skip later expensive agents

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use Claude family LLMs; results may not transfer to weaker/backend models.

Requires visible unit tests; not applicable when tests are unavailable or sparse.

When Not To Use

When no reliable unit tests exist for the target problems.

When only weak LLM backends are available (TQC may hurt performance).

Failure Modes

Imagined Execution produces incorrect emulated outputs and accepts/rejects wrongly (paper shows examples).

Synthesized tests are incorrect and, if not filtered, mislead self-debugging into worse code.

Core Entities

Models

Claude Sonnet-3.5-v2Claude Opus-3DeepSeek (as prior baseline)

Metrics

pass@1pass@5Accuracyprecisionrecall

Datasets

MBPPHumanEvalMBPP-EvalPlusHumanEval-EvalPlus

Benchmarks

MBPPHumanEvalMBPP-EvalPlusHumanEval-EvalPlus