A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

February 28, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Yandong Yan, Junwei Peng, Shijie Li, Chenxi Li, Yifei Shang, Can Deng, Ruiting Dai, Yongqiang Zhao, Jiaqi Zhu, Yu Huang

Links

Abstract / PDF

Why It Matters For Business

DenoiseFlow raises end-to-end reliability of multi-step LLM agents while cutting average API/tokens by roughly 40–56%, so you get fewer failures and lower inference bills without retraining models.

Summary TLDR

DenoiseFlow treats multi-step LLM workflows as a Noisy MDP and runs a closed-loop three-stage pipeline: (1) Sensing estimates per-step semantic uncertainty via small Monte Carlo sampling and clustering; (2) Regulating routes steps to Direct, Branch, or Refine modes using calibrated risk scores; (3) Correcting traces and fixes root causes via dependency-graph influence and targeted re-generation. On six benchmarks (math, code, multi-hop QA) it yields an average score of 83.3% (+1.3% vs the best reproduced baseline) while cutting compute cost by about 40–56% through adaptive branching. The system uses online self-calibration so it adapts without labeled data.

Problem Statement

Long multi-step LLM workflows accumulate small interpretation errors across steps, producing silent failures. Prior approaches either explore with a fixed budget, restart broadly after errors, or ignore uncertainty. The paper asks how to detect and act on semantic uncertainty at runtime to prevent error cascades while keeping compute costs practical.

Main Contribution

Noisy MDP formulation: recast multi-step LLM execution as stochastic transitions and accumulated semantic divergence.

DenoiseFlow system: a closed-loop Sensing→Regulating→Correcting pipeline that quantifies uncertainty, routes execution, and performs targeted recovery.

Uncertainty-aware adaptive branching: use Monte Carlo sampling + semantic clustering to decide when to run single-path vs. multi-path exploration.

Influence-based root-cause localization and asymmetric calibration: trace failures on a dependency graph and force local re-exploration.

Online self-calibration: adjust uncertainty thresholds from verifier feedback without ground-truth labels.

Key Findings

DenoiseFlow improves average benchmark performance versus strong baselines.

Numbers83.3% average accuracy vs JudgeFlow 82.0% (+1.3%)

Adaptive branching reduces compute cost while keeping accuracy.

Numbers40–56% cost reduction at matched accuracy vs fixed branching

Adaptive branching is the most critical component for accuracy.

NumbersRemoving Adaptive Branching causes −3.87% average accuracy drop

Online calibration materially improves results, especially for code tasks.

NumbersRemoving online calibration: −2.20% avg; MBPP −4.65%

Estimated uncertainty ranks well against actual difficulty.

NumbersBinned Spearman ρ = −0.782 between risk score and success rate

Results

Accuracy

Value93.9%

BaselineJudgeFlow 93.0%

Accuracy

Value61.4%

BaselineJudgeFlow 58.5%

MBPP pass@1

Value84.9%

BaselineJudgeFlow 83.8%

HumanEval pass@1

Value93.9%

BaselineJudgeFlow 93.4%

HotpotQA F1

Value77.5

BaselineJudgeFlow 77.4

DROP F1

Value87.9

BaselineJudgeFlow 86.1

Average (six benchmarks)

Value83.3%

BaselineJudgeFlow 82.0%

Who Should Care

What To Try In 7 Days

Run N=5 Monte Carlo samples for each critical step and cluster outputs to estimate entropy.

Route steps with high calibrated risk into K parallel candidates (K up to 7) and pick by consensus plus verifier checks.

Add a simple verifier (unit tests or answer checker) and update a temperature every ~20 problems to calibrate risk.

Agent Features

Memory

  • probabilistic dependency graph (short-term dependency tracking)

Planning

  • uncertainty-aware adaptive branching
  • progressive denoising under Noisy MDP

Tool Use

  • external verifiers (unit tests, checkers)
  • Monte Carlo sampling for uncertainty estimation

Frameworks

  • DenoiseFlow

Is Agentic

true

Architectures

  • closed-loop three-stage (Sensing-Regulating-Correcting)

Collaboration

  • compatible with offline workflow-discovery systems

Optimization Features

Token Efficiency

  • Accuracy

Infra Optimization

  • batch parallelization of N Monte Carlo calls per step

System Optimization

  • online temperature calibration to avoid systematic misallocation
  • parallel Monte Carlo sampling

Inference Optimization

  • adaptive branching reduces average LLM calls
  • reuse Monte Carlo samples for branching to limit extra calls

Reproducibility

Data Urls

  • GSM8K
  • MATH
  • MBPP
  • HumanEval
  • HotpotQA
  • DROP

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies on verifier signals; open-ended tasks without clear verifiers get less benefit.
  • Calibration needs ~20 warm-up examples and may misestimate during cold start.
  • Experiments mainly use GPT-4o-mini; other models may need retuning of thresholds.
  • Monte Carlo sampling adds overhead on trivially easy problems.

When Not To Use

  • Open-ended creative generation where verification is subjective or unavailable.
  • Low-latency microservices where any Monte Carlo overhead is unacceptable.
  • Cold-start workflows with fewer than ~20 examples and no prior calibration data.

Failure Modes

  • Accumulated semantic ambiguity across deep dependency chains.
  • Miscalibrated uncertainty causing over-branching or missed exploration.
  • Verifier blind spots that misclassify correct branches as failures.
  • High upfront cost from Stage 1 sampling for trivial inputs.

Core Entities

Models

  • GPT-4o-mini (backbone)
  • GPT-4o
  • DeepSeek-V2.5

Metrics

  • Accuracy
  • Pass@1
  • F1
  • Average LLM calls / problem
  • Cost per problem

Datasets

  • GSM8K
  • MATH
  • MBPP
  • HumanEval
  • HotpotQA
  • DROP

Benchmarks

  • GSM8K
  • MATH
  • MBPP
  • HumanEval
  • HotpotQA
  • DROP