Prevent mistakes before they happen: add per-agent pre-action checks plus post-action learning to multi-agent tool workflows.

May 27, 20257 min

Overview

Decision SnapshotReady For Pilot

The design is practical and validated on two public benchmarks; results depend on base LLM quality and token budget, and ablations show clear causal effects.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 65%

Novelty: 60%

Authors

Zikang Guo, Benfeng Xu, Xiaorui Wang, Zhendong Mao

Links

Abstract / PDF / Data

Why It Matters For Business

MIRROR lowers multi-step tool-call failures by combining pre-action checks and post-action learning, improving automation reliability without retraining core models.

Who Should Care

Summary TLDR

MIRROR is a three-agent framework (Planner, Tool, Answer) that pairs a pre-action self-check (intra-reflection) inside each agent with post-action learning across rounds (inter-reflection). On tool-use benchmarks, MIRROR raises pass/delivery rates substantially vs. existing reflection or single-agent baselines. Key practical takeaways: add short, score-driven self-evaluations before each tool call, keep short-term and task-specific long-term memories, and tune the number of inter-reflection rounds to balance tokens and accuracy.

Problem Statement

LLM-based agents still make preventable errors when coordinating tools across multi-step tasks. Existing reflection methods only learn after actions, which wastes tokens and lets errors propagate. The paper asks: can agents anticipate bad outcomes before acting and then combine pre-action checks with post-action learning to reduce errors?

Main Contribution

Introduce intra-reflection: a lightweight, prompt-based self-evaluation that runs inside each agent before executing or handing off an action.

Design MIRROR: a three-agent pipeline (Planner/Tool/Answer) that pairs intra-reflection with inter-reflection via short-term and task-specific long-term memory.

Key Findings

MIRROR achieves high pass rates on StableToolBench.

NumbersAverage Pass Rate 85.7% (MIRROR, Table 3)

Practical UseExpect ~85% success on evaluated tool-invocation tasks when using MIRROR with a capable LLM; add intra+inter reflection for better tool reliability.

Evidence RefTable 3 (StableToolBench)

Removing all intra-reflection reduces performance by 7.0 percentage points.

Numbers85.7%78.7% (−7.0%) when ablating intra-reflection

Practical UseImplement per-agent pre-action checks (Planner, Tool, Answer). They materially cut errors and are worth the token cost.

Evidence RefTable 3 (Ablation)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
StableToolBench Average Pass Rate (MIRROR)85.7%Next best method per-LMup to +7.0 pp vs next bestStableToolBench (various splits)Table 3 shows MIRROR average 85.7% on StableToolBench; per-LM gains described in Section 4.2Table 3; Section 4.2
TravelPlanner Delivery Rate (MIRROR vs ReAct)e.g., GPT-4o: 100% (MIRROR) vs 85.6% (ReAct)ReAct+14.4 pp (example with GPT-4o)TravelPlanner (validation set)Table 2 lists per-LLM delivery rates; authors report 14.4%–21.1% increasesTable 2; Section 4.2

What To Try In 7 Days

Add a short pre-execution self-check prompt inside each agent that scores planned actions and retries below a threshold.

Keep a short-term per-task memory of recent failures to adapt tool parameter choices quickly.

Run 3–5 rounds of lightweight post-execution reflection and measure token vs. success trade-offs.

Agent Features

Memory
Short-Term Memory (STM) for per-subtask failuresLong-Term Memory (LTM) task-specific trajectory storage
Planning
task decompositiontopological ordering of subtasksPlanner intra-reflection score gating
Tool Use
tool selectionparameter selectionprompt-based Tool Agentfunction-calling mode evaluated
Frameworks
MIRROR
Is Agentic

Yes

Architectures
multi-agent (Planner, Tool, Answer)dual-memory (short-term and task-specific long-term)
Collaboration
reflection-gated handoffs between agentsiterative inter-reflection rounds

Optimization Features

Token Efficiency
Claims improved tokens/Pass trade-off vs DFSDT/SmurfsReports token counts per inter-reflection round (12.8k–17.2k)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

StableToolBench (paper)TravelPlanner (paper)

Risks & Boundaries

Limitations

Performance depends on the base LLM; gains smaller with weaker cores.

Higher token use from dual reflection; cost-sensitive applications may need tuning.

When Not To Use

When token cost or latency is a strict constraint and you cannot afford multiple reflection rounds.

For single-step tasks with trivial tool calls where pre-check overhead outweighs benefit.

Failure Modes

Excessive reflection rounds can harm performance via redundancy and reasoning degradation.

Poor initial decomposition can cause repeated cycles unless inter-reflection corrects it.

Core Entities

Models

gpt-3.5-turbogpt-4o-minigpt-4ogpt-4-turboClaude 3 HaikuQwen2.5-72B

Metrics

Pass RateWin RateDelivery RateCommonsense Constraint Pass RateHard Constraint Pass RateFinal Pass Rate

Datasets

StableToolBenchTravelPlanner

Benchmarks

StableToolBenchTravelPlannerBFCL-v1 (external mention)