Overview
Production Readiness
0.65
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
MIRROR lowers multi-step tool-call failures by combining pre-action checks and post-action learning, improving automation reliability without retraining core models.
Summary TLDR
MIRROR is a three-agent framework (Planner, Tool, Answer) that pairs a pre-action self-check (intra-reflection) inside each agent with post-action learning across rounds (inter-reflection). On tool-use benchmarks, MIRROR raises pass/delivery rates substantially vs. existing reflection or single-agent baselines. Key practical takeaways: add short, score-driven self-evaluations before each tool call, keep short-term and task-specific long-term memories, and tune the number of inter-reflection rounds to balance tokens and accuracy.
Problem Statement
LLM-based agents still make preventable errors when coordinating tools across multi-step tasks. Existing reflection methods only learn after actions, which wastes tokens and lets errors propagate. The paper asks: can agents anticipate bad outcomes before acting and then combine pre-action checks with post-action learning to reduce errors?
Main Contribution
Introduce intra-reflection: a lightweight, prompt-based self-evaluation that runs inside each agent before executing or handing off an action.
Design MIRROR: a three-agent pipeline (Planner/Tool/Answer) that pairs intra-reflection with inter-reflection via short-term and task-specific long-term memory.
Extensive evaluation on StableToolBench and TravelPlanner showing consistent gains vs. ReAct, Reflexion, DFSDT, Smurfs, and SFT baselines.
Key Findings
MIRROR achieves high pass rates on StableToolBench.
Removing all intra-reflection reduces performance by 7.0 percentage points.
Inter-reflection rounds trade off tokens and accuracy; 5 rounds is optimal in their tests.
MIRROR outperforms supervised fine-tuned tool models on tool benchmarks.
MIRROR increases TravelPlanner delivery and constraint pass rates vs. ReAct.
Results
StableToolBench Average Pass Rate (MIRROR)
TravelPlanner Delivery Rate (MIRROR vs ReAct)
Ablation: remove all intra-reflection
Accuracy
Who Should Care
What To Try In 7 Days
Add a short pre-execution self-check prompt inside each agent that scores planned actions and retries below a threshold.
Keep a short-term per-task memory of recent failures to adapt tool parameter choices quickly.
Run 3–5 rounds of lightweight post-execution reflection and measure token vs. success trade-offs.
Agent Features
Memory
- Short-Term Memory (STM) for per-subtask failures
- Long-Term Memory (LTM) task-specific trajectory storage
Planning
- task decomposition
- topological ordering of subtasks
- Planner intra-reflection score gating
Tool Use
- tool selection
- parameter selection
- prompt-based Tool Agent
- function-calling mode evaluated
Frameworks
- MIRROR
Is Agentic
true
Architectures
- multi-agent (Planner, Tool, Answer)
- dual-memory (short-term and task-specific long-term)
Collaboration
- reflection-gated handoffs between agents
- iterative inter-reflection rounds
Optimization Features
Token Efficiency
- Claims improved tokens/Pass trade-off vs DFSDT/Smurfs
- Reports token counts per inter-reflection round (12.8k–17.2k)
Reproducibility
Data Urls
- StableToolBench (paper)
- TravelPlanner (paper)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Performance depends on the base LLM; gains smaller with weaker cores.
- Higher token use from dual reflection; cost-sensitive applications may need tuning.
- Task-specific long-term memory limits cross-task generalization.
- Experiments used a limited set of LLM cores and validation splits.
When Not To Use
- When token cost or latency is a strict constraint and you cannot afford multiple reflection rounds.
- For single-step tasks with trivial tool calls where pre-check overhead outweighs benefit.
- When the base LLM lacks basic tool-handling capability; MIRROR cannot fully compensate.
Failure Modes
- Excessive reflection rounds can harm performance via redundancy and reasoning degradation.
- Poor initial decomposition can cause repeated cycles unless inter-reflection corrects it.
- If intra-reflection criteria are miscalibrated, agents may loop or reject valid actions.
Core Entities
Models
- gpt-3.5-turbo
- gpt-4o-mini
- gpt-4o
- gpt-4-turbo
- Claude 3 Haiku
- Qwen2.5-72B
Metrics
- Pass Rate
- Win Rate
- Delivery Rate
- Commonsense Constraint Pass Rate
- Hard Constraint Pass Rate
- Final Pass Rate
Datasets
- StableToolBench
- TravelPlanner
Benchmarks
- StableToolBench
- TravelPlanner
- BFCL-v1 (external mention)

