Prevent mistakes before they happen: add per-agent pre-action checks plus post-action learning to multi-agent tool workflows.

May 27, 20257 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Zikang Guo, Benfeng Xu, Xiaorui Wang, Zhendong Mao

Links

Abstract / PDF

Why It Matters For Business

MIRROR lowers multi-step tool-call failures by combining pre-action checks and post-action learning, improving automation reliability without retraining core models.

Summary TLDR

MIRROR is a three-agent framework (Planner, Tool, Answer) that pairs a pre-action self-check (intra-reflection) inside each agent with post-action learning across rounds (inter-reflection). On tool-use benchmarks, MIRROR raises pass/delivery rates substantially vs. existing reflection or single-agent baselines. Key practical takeaways: add short, score-driven self-evaluations before each tool call, keep short-term and task-specific long-term memories, and tune the number of inter-reflection rounds to balance tokens and accuracy.

Problem Statement

LLM-based agents still make preventable errors when coordinating tools across multi-step tasks. Existing reflection methods only learn after actions, which wastes tokens and lets errors propagate. The paper asks: can agents anticipate bad outcomes before acting and then combine pre-action checks with post-action learning to reduce errors?

Main Contribution

Introduce intra-reflection: a lightweight, prompt-based self-evaluation that runs inside each agent before executing or handing off an action.

Design MIRROR: a three-agent pipeline (Planner/Tool/Answer) that pairs intra-reflection with inter-reflection via short-term and task-specific long-term memory.

Extensive evaluation on StableToolBench and TravelPlanner showing consistent gains vs. ReAct, Reflexion, DFSDT, Smurfs, and SFT baselines.

Key Findings

MIRROR achieves high pass rates on StableToolBench.

NumbersAverage Pass Rate 85.7% (MIRROR, Table 3)

Removing all intra-reflection reduces performance by 7.0 percentage points.

Numbers85.7% → 78.7% (−7.0%) when ablating intra-reflection

Inter-reflection rounds trade off tokens and accuracy; 5 rounds is optimal in their tests.

Numbers5 rounds: 85.7% @ 13.6k tokens; 3 rounds: 83.4% @ 12.8k; 7 rounds: 82.3% @ 17.2k

MIRROR outperforms supervised fine-tuned tool models on tool benchmarks.

NumbersMIRROR ~85.7% vs ToolLlama-2/ToolGen ~46% Pass Rate

MIRROR increases TravelPlanner delivery and constraint pass rates vs. ReAct.

NumbersDelivery Rate uplift 14.4%–21.1% depending on LLM (e.g., GPT-4o: 85.6→100)

Results

StableToolBench Average Pass Rate (MIRROR)

Value85.7%

BaselineNext best method per-LM

TravelPlanner Delivery Rate (MIRROR vs ReAct)

Valuee.g., GPT-4o: 100% (MIRROR) vs 85.6% (ReAct)

BaselineReAct

Ablation: remove all intra-reflection

Value78.7% average Pass Rate

BaselineMIRROR 85.7%

Accuracy

Value5 rounds: 85.7% @ 13.6k tokens/query

Baseline3 rounds: 83.4% @ 12.8k; 7 rounds: 82.3% @ 17.2k

Who Should Care

What To Try In 7 Days

Add a short pre-execution self-check prompt inside each agent that scores planned actions and retries below a threshold.

Keep a short-term per-task memory of recent failures to adapt tool parameter choices quickly.

Run 3–5 rounds of lightweight post-execution reflection and measure token vs. success trade-offs.

Agent Features

Memory

  • Short-Term Memory (STM) for per-subtask failures
  • Long-Term Memory (LTM) task-specific trajectory storage

Planning

  • task decomposition
  • topological ordering of subtasks
  • Planner intra-reflection score gating

Tool Use

  • tool selection
  • parameter selection
  • prompt-based Tool Agent
  • function-calling mode evaluated

Frameworks

  • MIRROR

Is Agentic

true

Architectures

  • multi-agent (Planner, Tool, Answer)
  • dual-memory (short-term and task-specific long-term)

Collaboration

  • reflection-gated handoffs between agents
  • iterative inter-reflection rounds

Optimization Features

Token Efficiency

  • Claims improved tokens/Pass trade-off vs DFSDT/Smurfs
  • Reports token counts per inter-reflection round (12.8k–17.2k)

Reproducibility

Data Urls

  • StableToolBench (paper)
  • TravelPlanner (paper)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Performance depends on the base LLM; gains smaller with weaker cores.
  • Higher token use from dual reflection; cost-sensitive applications may need tuning.
  • Task-specific long-term memory limits cross-task generalization.
  • Experiments used a limited set of LLM cores and validation splits.

When Not To Use

  • When token cost or latency is a strict constraint and you cannot afford multiple reflection rounds.
  • For single-step tasks with trivial tool calls where pre-check overhead outweighs benefit.
  • When the base LLM lacks basic tool-handling capability; MIRROR cannot fully compensate.

Failure Modes

  • Excessive reflection rounds can harm performance via redundancy and reasoning degradation.
  • Poor initial decomposition can cause repeated cycles unless inter-reflection corrects it.
  • If intra-reflection criteria are miscalibrated, agents may loop or reject valid actions.

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-4o-mini
  • gpt-4o
  • gpt-4-turbo
  • Claude 3 Haiku
  • Qwen2.5-72B

Metrics

  • Pass Rate
  • Win Rate
  • Delivery Rate
  • Commonsense Constraint Pass Rate
  • Hard Constraint Pass Rate
  • Final Pass Rate

Datasets

  • StableToolBench
  • TravelPlanner

Benchmarks

  • StableToolBench
  • TravelPlanner
  • BFCL-v1 (external mention)