Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
15
Why It Matters For Business
Interactive tool use and short user feedback materially change model success; measuring multi-turn behavior prevents wrong model choices and mispriced evaluation costs.
Summary TLDR
MINT is a reproducible benchmark that measures how well LLMs solve hard tasks when they can (1) call external tools by running Python code and (2) receive natural-language feedback simulated by GPT-4. The authors curate 586 challenging instances from 8 existing datasets (reasoning, code, decision-making) and evaluate 20 models (4 closed, 16 open). Key findings: each extra tool turn raises success rate by ~1–8% (absolute); GPT-4-style feedback yields ~2–17% gains; better single-turn results do not guarantee better multi-turn improvement; supervised finetuning (SIFT) and RLHF often hurt multi-turn ability. Code and evaluation scripts are released.
Problem Statement
Most benchmarks test single-turn input→output. Real users interact with LLMs across multiple turns and can use tools or give language feedback. We lack a compact, reproducible benchmark that measures how models benefit from iterative tool use and from natural-language feedback.
Main Contribution
Introduce MINT, a reproducible multi-turn benchmark for tool-augmented task solving and language feedback.
Provide a Python-execution interface for tools and use GPT-4 to simulate human feedback, enabling scalable evaluation.
Curate a compact set of 586 challenging instances from 8 public datasets covering reasoning, coding, and decision-making.
Evaluate 20 LLMs (4 closed, 16 open) and quantify per-turn tool gains and gains from natural-language feedback.
Identify practical failure modes and data artifacts that harm multi-turn interaction (formatting, training-data artifacts).
Key Findings
Tool interaction gives consistent, per-turn success gains.
Natural-language feedback (simulated by GPT-4) boosts success.
Single-turn accuracy does not predict multi-turn gains.
Instruction finetuning and RLHF sometimes harm multi-turn performance.
GPT-4 simulated feedback is close to human feedback by human judgment.
Results
Tool-augmented gain per extra turn (∆tools)
Natural-language feedback gain (∆feedback)
Open vs closed-source absolute SR with feedback
Human evaluation of GPT-4 feedback
Who Should Care
What To Try In 7 Days
Run MINT (k=5) on your candidate models to compare multi-turn SR, not just single-shot accuracy.
Simulate short natural-language feedback with a high-quality model (e.g., GPT-4) to estimate gains before paying human annotators.
Test base vs. SIFT vs. RLHF variants for your use case; alignment steps can degrade multi-turn tool use.
Agent Features
Memory
- short-term interaction history across turns (k ≤ 5)
Planning
- multi-turn tool-driven planning
Tool Use
- Python code execution as unified tool interface
- wiki search (for reasoning tasks)
- ALFWorld action API (decision-making)
Frameworks
- GPT-4 used to simulate natural-language feedback
Is Agentic
true
Architectures
- chat-style LLMs
- base LLMs
Collaboration
- user-LLM-tool loop with simulated human feedback
Optimization Features
Training Optimization
- analysis of SIFT and RLHF effects on multi-turn performance
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Feedback is simulated by GPT-4; it is close to humans in tests but may not cover all real-user behaviors or value judgments.
- MINT uses a compact subset (586 examples) for tractability; broader populations may change model rankings.
- Metrics focus on success rate; they do not directly measure interaction quality, cost, or user satisfaction.
- Some evaluated models fail to follow output-format instructions, reducing measured performance in ways unrelated to core capability.
When Not To Use
- When you need full-scale human feedback across varied user populations for normative judgments.
- When your tooling set differs substantially from Python-executable tools or the ALFWorld/A.P.I.s used here.
- If you need fine-grained assessment of conversational quality beyond task success.
Failure Modes
- Formatting failures: some models ignore requested tags causing unparsable outputs and lowered SR (Table A.7).
- Training-data artifacts: Vicuna models inject backslash-escaped underscores causing syntax errors (Table A.9).
- SIFT / RLHF regression: instruction finetuning or RLHF sometimes reduces multi-turn gains.
- Overfitting to training tags: CodeLLaMA-Instruct outputs [PYTHON] tags regardless of prompt, breaking parser (Table A.10).
Core Entities
Models
- gpt-3.5-turbo-0613
- gpt-4-0613
- claude-2
- claude-instant-1
- chat-bison-001 (Bard)
- LLaMA-2 (7B,13B,70B)
- Vicuna-v1.5 (7B,13B)
- CodeLLaMA (7B,13B,34B)
- Lemur-v1-70B
Metrics
- Success Rate (SR)
- ∆tools (improvement per interaction turn)
- ∆feedback (improvement from natural-language feedback)
Datasets
- HumanEval
- MBPP
- ALFWorld
- GSM8K
- HotpotQA
- MATH
- MMLU
- TheoremQA
Benchmarks
- MINT
Context Entities
Models
- Vicuna (SIFT on ShareGPT)
- CodeLLaMA-Instruct (SIFT)

