MINT: a compact benchmark that tests LLMs on multi-turn tool use and natural-language feedback

Overview

Decision SnapshotNeeds Validation

The benchmark gives reproducible, measurable signals about multi-turn gains but uses a curated subset and simulated feedback, so results are strong for the evaluated settings but may not fully generalize to all tools and real users.

Citations15

Evidence Strength0.80

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 1/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, Heng Ji

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Interactive tool use and short user feedback materially change model success; measuring multi-turn behavior prevents wrong model choices and mispriced evaluation costs.

Who Should Care

Product Manager ML Engineer CTO Engineering Lead Data Scientist

Summary TLDR

MINT is a reproducible benchmark that measures how well LLMs solve hard tasks when they can (1) call external tools by running Python code and (2) receive natural-language feedback simulated by GPT-4. The authors curate 586 challenging instances from 8 existing datasets (reasoning, code, decision-making) and evaluate 20 models (4 closed, 16 open). Key findings: each extra tool turn raises success rate by ~1–8% (absolute); GPT-4-style feedback yields ~2–17% gains; better single-turn results do not guarantee better multi-turn improvement; supervised finetuning (SIFT) and RLHF often hurt multi-turn ability. Code and evaluation scripts are released.

Problem Statement

Most benchmarks test single-turn input→output. Real users interact with LLMs across multiple turns and can use tools or give language feedback. We lack a compact, reproducible benchmark that measures how models benefit from iterative tool use and from natural-language feedback.

Main Contribution

Introduce MINT, a reproducible multi-turn benchmark for tool-augmented task solving and language feedback.

Provide a Python-execution interface for tools and use GPT-4 to simulate human feedback, enabling scalable evaluation.

Key Findings

Tool interaction gives consistent, per-turn success gains.

Numbers1–8% absolute SR gain per extra tool turn (micro-avg across tasks)

Practical UseAllow models to call tools for multiple rounds; expect modest but steady success improvements per turn when designing agent loops.

Evidence RefAbstract; §3.2; Table 2

Natural-language feedback (simulated by GPT-4) boosts success.

Numbers+2–17% absolute SR gain with GPT-4 feedback (k=5)

Practical UseCollect or simulate short textual feedback to improve outcomes; even one helpful feedback turn can materially raise success rates.

Evidence RefAbstract; §3.3; Table 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Tool-augmented gain per extra turn (∆tools)	1–8% absolute SR gain per extra tool turn (micro-avg across tasks)	SR at k=1 (no interaction)	—	micro-average across MINT (586 instances)	Abstract; §3.2; Table 2	Table 2
Natural-language feedback gain (∆feedback)	2–17% absolute SR gain with GPT-4 feedback (k=5)	SR with k=5 without feedback	—	micro-average across MINT	Abstract; §3.3; Table 3	Table 3

What To Try In 7 Days

Run MINT (k=5) on your candidate models to compare multi-turn SR, not just single-shot accuracy.

Simulate short natural-language feedback with a high-quality model (e.g., GPT-4) to estimate gains before paying human annotators.

Test base vs. SIFT vs. RLHF variants for your use case; alignment steps can degrade multi-turn tool use.

Agent Features

Memory

short-term interaction history across turns (k ≤ 5)

Planning

multi-turn tool-driven planning

Tool Use

Python code execution as unified tool interfacewiki search (for reasoning tasks)ALFWorld action API (decision-making)

Frameworks

GPT-4 used to simulate natural-language feedback

Is Agentic

Yes

Architectures

chat-style LLMsbase LLMs

Collaboration

user-LLM-tool loop with simulated human feedback

Optimization Features

Training Optimization

analysis of SIFT and RLHF effects on multi-turn performance

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://xingyaoww.github.io/mint-bench

Data URLs

https://xingyaoww.github.io/mint-bench (curation and scripts)

Risks & Boundaries

Limitations

Feedback is simulated by GPT-4; it is close to humans in tests but may not cover all real-user behaviors or value judgments.

MINT uses a compact subset (586 examples) for tractability; broader populations may change model rankings.

When Not To Use

When you need full-scale human feedback across varied user populations for normative judgments.

When your tooling set differs substantially from Python-executable tools or the ALFWorld/A.P.I.s used here.

Failure Modes

Formatting failures: some models ignore requested tags causing unparsable outputs and lowered SR (Table A.7).

Training-data artifacts: Vicuna models inject backslash-escaped underscores causing syntax errors (Table A.9).

Core Entities

Models

gpt-3.5-turbo-0613gpt-4-0613claude-2claude-instant-1chat-bison-001 (Bard)LLaMA-2 (7B,13B,70B)Vicuna-v1.5 (7B,13B)CodeLLaMA (7B,13B,34B)Lemur-v1-70B

Metrics

Success Rate (SR)∆tools (improvement per interaction turn)∆feedback (improvement from natural-language feedback)

Datasets

HumanEvalMBPPALFWorldGSM8KHotpotQAMATHMMLUTheoremQA

Benchmarks

MINT

Context Entities

Models

Vicuna (SIFT on ShareGPT)CodeLLaMA-Instruct (SIFT)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Tool interaction gives consistent, per-turn success gains.

Natural-language feedback (simulated by GPT-4) boosts success.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding