Prevent mistakes before they happen: add per-agent pre-action checks plus post-action learning to multi-agent tool workflows.

Overview

Decision SnapshotReady For Pilot

The design is practical and validated on two public benchmarks; results depend on base LLM quality and token budget, and ablations show clear causal effects.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 40%

Production readiness: 65%

Novelty: 60%

Authors

Zikang Guo, Benfeng Xu, Xiaorui Wang, Zhendong Mao

Links

Abstract / PDF / Data

Why It Matters For Business

MIRROR lowers multi-step tool-call failures by combining pre-action checks and post-action learning, improving automation reliability without retraining core models.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

MIRROR is a three-agent framework (Planner, Tool, Answer) that pairs a pre-action self-check (intra-reflection) inside each agent with post-action learning across rounds (inter-reflection). On tool-use benchmarks, MIRROR raises pass/delivery rates substantially vs. existing reflection or single-agent baselines. Key practical takeaways: add short, score-driven self-evaluations before each tool call, keep short-term and task-specific long-term memories, and tune the number of inter-reflection rounds to balance tokens and accuracy.

Problem Statement

LLM-based agents still make preventable errors when coordinating tools across multi-step tasks. Existing reflection methods only learn after actions, which wastes tokens and lets errors propagate. The paper asks: can agents anticipate bad outcomes before acting and then combine pre-action checks with post-action learning to reduce errors?

Main Contribution

Introduce intra-reflection: a lightweight, prompt-based self-evaluation that runs inside each agent before executing or handing off an action.

Design MIRROR: a three-agent pipeline (Planner/Tool/Answer) that pairs intra-reflection with inter-reflection via short-term and task-specific long-term memory.

Key Findings

MIRROR achieves high pass rates on StableToolBench.

NumbersAverage Pass Rate 85.7% (MIRROR, Table 3)

Practical UseExpect ~85% success on evaluated tool-invocation tasks when using MIRROR with a capable LLM; add intra+inter reflection for better tool reliability.

Evidence RefTable 3 (StableToolBench)

Removing all intra-reflection reduces performance by 7.0 percentage points.

Numbers85.7% → 78.7% (−7.0%) when ablating intra-reflection

Practical UseImplement per-agent pre-action checks (Planner, Tool, Answer). They materially cut errors and are worth the token cost.

Evidence RefTable 3 (Ablation)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
StableToolBench Average Pass Rate (MIRROR)	85.7%	Next best method per-LM	up to +7.0 pp vs next best	StableToolBench (various splits)	Table 3 shows MIRROR average 85.7% on StableToolBench; per-LM gains described in Section 4.2	Table 3; Section 4.2
TravelPlanner Delivery Rate (MIRROR vs ReAct)	e.g., GPT-4o: 100% (MIRROR) vs 85.6% (ReAct)	ReAct	+14.4 pp (example with GPT-4o)	TravelPlanner (validation set)	Table 2 lists per-LLM delivery rates; authors report 14.4%–21.1% increases	Table 2; Section 4.2

What To Try In 7 Days

Add a short pre-execution self-check prompt inside each agent that scores planned actions and retries below a threshold.

Keep a short-term per-task memory of recent failures to adapt tool parameter choices quickly.

Run 3–5 rounds of lightweight post-execution reflection and measure token vs. success trade-offs.

Agent Features

Memory

Short-Term Memory (STM) for per-subtask failuresLong-Term Memory (LTM) task-specific trajectory storage

Planning

task decompositiontopological ordering of subtasksPlanner intra-reflection score gating

Tool Use

tool selectionparameter selectionprompt-based Tool Agentfunction-calling mode evaluated

Frameworks

MIRROR

Is Agentic

Yes

Architectures

multi-agent (Planner, Tool, Answer)dual-memory (short-term and task-specific long-term)

Collaboration

reflection-gated handoffs between agentsiterative inter-reflection rounds

Optimization Features

Token Efficiency

Claims improved tokens/Pass trade-off vs DFSDT/SmurfsReports token counts per inter-reflection round (12.8k–17.2k)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

StableToolBench (paper)TravelPlanner (paper)

Risks & Boundaries

Limitations

Performance depends on the base LLM; gains smaller with weaker cores.

Higher token use from dual reflection; cost-sensitive applications may need tuning.

When Not To Use

When token cost or latency is a strict constraint and you cannot afford multiple reflection rounds.

For single-step tasks with trivial tool calls where pre-check overhead outweighs benefit.

Failure Modes

Excessive reflection rounds can harm performance via redundancy and reasoning degradation.

Poor initial decomposition can cause repeated cycles unless inter-reflection corrects it.

Core Entities

Models

gpt-3.5-turbogpt-4o-minigpt-4ogpt-4-turboClaude 3 HaikuQwen2.5-72B

Metrics

Pass RateWin RateDelivery RateCommonsense Constraint Pass RateHard Constraint Pass RateFinal Pass Rate

Datasets

StableToolBenchTravelPlanner

Benchmarks

StableToolBenchTravelPlannerBFCL-v1 (external mention)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MIRROR achieves high pass rates on StableToolBench.

Removing all intra-reflection reduces performance by 7.0 percentage points.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Argues that 'agentic' buzzwords mostly rebrand decades-old agent and multi-agent research

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

A dynamic town simulation that tests LLM agents on doing tasks while following local cultural norms

Key finding

A process-aware, auditable multi-agent evaluator that produces more stable, human-aligned scores than a single LLM judge

Key finding