MINT: a compact benchmark that tests LLMs on multi-turn tool use and natural-language feedback

September 19, 20239 min

Overview

Decision SnapshotNeeds Validation

The benchmark gives reproducible, measurable signals about multi-turn gains but uses a curated subset and simulated feedback, so results are strong for the evaluated settings but may not fully generalize to all tools and real users.

Citations15

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, Heng Ji

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Interactive tool use and short user feedback materially change model success; measuring multi-turn behavior prevents wrong model choices and mispriced evaluation costs.

Who Should Care

Summary TLDR

MINT is a reproducible benchmark that measures how well LLMs solve hard tasks when they can (1) call external tools by running Python code and (2) receive natural-language feedback simulated by GPT-4. The authors curate 586 challenging instances from 8 existing datasets (reasoning, code, decision-making) and evaluate 20 models (4 closed, 16 open). Key findings: each extra tool turn raises success rate by ~1–8% (absolute); GPT-4-style feedback yields ~2–17% gains; better single-turn results do not guarantee better multi-turn improvement; supervised finetuning (SIFT) and RLHF often hurt multi-turn ability. Code and evaluation scripts are released.

Problem Statement

Most benchmarks test single-turn input→output. Real users interact with LLMs across multiple turns and can use tools or give language feedback. We lack a compact, reproducible benchmark that measures how models benefit from iterative tool use and from natural-language feedback.

Main Contribution

Introduce MINT, a reproducible multi-turn benchmark for tool-augmented task solving and language feedback.

Provide a Python-execution interface for tools and use GPT-4 to simulate human feedback, enabling scalable evaluation.

Key Findings

Tool interaction gives consistent, per-turn success gains.

Numbers18% absolute SR gain per extra tool turn (micro-avg across tasks)

Practical UseAllow models to call tools for multiple rounds; expect modest but steady success improvements per turn when designing agent loops.

Evidence RefAbstract; §3.2; Table 2

Natural-language feedback (simulated by GPT-4) boosts success.

Numbers+217% absolute SR gain with GPT-4 feedback (k=5)

Practical UseCollect or simulate short textual feedback to improve outcomes; even one helpful feedback turn can materially raise success rates.

Evidence RefAbstract; §3.3; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Tool-augmented gain per extra turn (∆tools)18% absolute SR gain per extra tool turn (micro-avg across tasks)SR at k=1 (no interaction)micro-average across MINT (586 instances)Abstract; §3.2; Table 2Table 2
Natural-language feedback gain (∆feedback)217% absolute SR gain with GPT-4 feedback (k=5)SR with k=5 without feedbackmicro-average across MINTAbstract; §3.3; Table 3Table 3

What To Try In 7 Days

Run MINT (k=5) on your candidate models to compare multi-turn SR, not just single-shot accuracy.

Simulate short natural-language feedback with a high-quality model (e.g., GPT-4) to estimate gains before paying human annotators.

Test base vs. SIFT vs. RLHF variants for your use case; alignment steps can degrade multi-turn tool use.

Agent Features

Memory
short-term interaction history across turns (k ≤ 5)
Planning
multi-turn tool-driven planning
Tool Use
Python code execution as unified tool interfacewiki search (for reasoning tasks)ALFWorld action API (decision-making)
Frameworks
GPT-4 used to simulate natural-language feedback
Is Agentic

Yes

Architectures
chat-style LLMsbase LLMs
Collaboration
user-LLM-tool loop with simulated human feedback

Optimization Features

Training Optimization
analysis of SIFT and RLHF effects on multi-turn performance

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Feedback is simulated by GPT-4; it is close to humans in tests but may not cover all real-user behaviors or value judgments.

MINT uses a compact subset (586 examples) for tractability; broader populations may change model rankings.

When Not To Use

When you need full-scale human feedback across varied user populations for normative judgments.

When your tooling set differs substantially from Python-executable tools or the ALFWorld/A.P.I.s used here.

Failure Modes

Formatting failures: some models ignore requested tags causing unparsable outputs and lowered SR (Table A.7).

Training-data artifacts: Vicuna models inject backslash-escaped underscores causing syntax errors (Table A.9).

Core Entities

Models

gpt-3.5-turbo-0613gpt-4-0613claude-2claude-instant-1chat-bison-001 (Bard)LLaMA-2 (7B,13B,70B)Vicuna-v1.5 (7B,13B)CodeLLaMA (7B,13B,34B)Lemur-v1-70B

Metrics

Success Rate (SR)∆tools (improvement per interaction turn)∆feedback (improvement from natural-language feedback)

Datasets

HumanEvalMBPPALFWorldGSM8KHotpotQAMATHMMLUTheoremQA

Benchmarks

MINT

Context Entities

Models

Vicuna (SIFT on ShareGPT)CodeLLaMA-Instruct (SIFT)