Overview
The design is practical and validated on two public benchmarks; results depend on base LLM quality and token budget, and ablations show clear causal effects.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Unknown
At A Glance
Cost impact: 40%
Production readiness: 65%
Novelty: 60%
Why It Matters For Business
MIRROR lowers multi-step tool-call failures by combining pre-action checks and post-action learning, improving automation reliability without retraining core models.
Who Should Care
Summary TLDR
MIRROR is a three-agent framework (Planner, Tool, Answer) that pairs a pre-action self-check (intra-reflection) inside each agent with post-action learning across rounds (inter-reflection). On tool-use benchmarks, MIRROR raises pass/delivery rates substantially vs. existing reflection or single-agent baselines. Key practical takeaways: add short, score-driven self-evaluations before each tool call, keep short-term and task-specific long-term memories, and tune the number of inter-reflection rounds to balance tokens and accuracy.
Problem Statement
LLM-based agents still make preventable errors when coordinating tools across multi-step tasks. Existing reflection methods only learn after actions, which wastes tokens and lets errors propagate. The paper asks: can agents anticipate bad outcomes before acting and then combine pre-action checks with post-action learning to reduce errors?
Main Contribution
Introduce intra-reflection: a lightweight, prompt-based self-evaluation that runs inside each agent before executing or handing off an action.
Design MIRROR: a three-agent pipeline (Planner/Tool/Answer) that pairs intra-reflection with inter-reflection via short-term and task-specific long-term memory.
Key Findings
MIRROR achieves high pass rates on StableToolBench.
Removing all intra-reflection reduces performance by 7.0 percentage points.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| StableToolBench Average Pass Rate (MIRROR) | 85.7% | Next best method per-LM | up to +7.0 pp vs next best | StableToolBench (various splits) | Table 3 shows MIRROR average 85.7% on StableToolBench; per-LM gains described in Section 4.2 | Table 3; Section 4.2 |
| TravelPlanner Delivery Rate (MIRROR vs ReAct) | e.g., GPT-4o: 100% (MIRROR) vs 85.6% (ReAct) | ReAct | +14.4 pp (example with GPT-4o) | TravelPlanner (validation set) | Table 2 lists per-LLM delivery rates; authors report 14.4%–21.1% increases | Table 2; Section 4.2 |
What To Try In 7 Days
Add a short pre-execution self-check prompt inside each agent that scores planned actions and retries below a threshold.
Keep a short-term per-task memory of recent failures to adapt tool parameter choices quickly.
Run 3–5 rounds of lightweight post-execution reflection and measure token vs. success trade-offs.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Performance depends on the base LLM; gains smaller with weaker cores.
Higher token use from dual reflection; cost-sensitive applications may need tuning.
When Not To Use
When token cost or latency is a strict constraint and you cannot afford multiple reflection rounds.
For single-step tasks with trivial tool calls where pre-check overhead outweighs benefit.
Failure Modes
Excessive reflection rounds can harm performance via redundancy and reasoning degradation.
Poor initial decomposition can cause repeated cycles unless inter-reflection corrects it.

