MINT: a compact benchmark that tests LLMs on multi-turn tool use and natural-language feedback

September 19, 20239 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

15

Authors

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, Heng Ji

Links

Abstract / PDF

Why It Matters For Business

Interactive tool use and short user feedback materially change model success; measuring multi-turn behavior prevents wrong model choices and mispriced evaluation costs.

Summary TLDR

MINT is a reproducible benchmark that measures how well LLMs solve hard tasks when they can (1) call external tools by running Python code and (2) receive natural-language feedback simulated by GPT-4. The authors curate 586 challenging instances from 8 existing datasets (reasoning, code, decision-making) and evaluate 20 models (4 closed, 16 open). Key findings: each extra tool turn raises success rate by ~1–8% (absolute); GPT-4-style feedback yields ~2–17% gains; better single-turn results do not guarantee better multi-turn improvement; supervised finetuning (SIFT) and RLHF often hurt multi-turn ability. Code and evaluation scripts are released.

Problem Statement

Most benchmarks test single-turn input→output. Real users interact with LLMs across multiple turns and can use tools or give language feedback. We lack a compact, reproducible benchmark that measures how models benefit from iterative tool use and from natural-language feedback.

Main Contribution

Introduce MINT, a reproducible multi-turn benchmark for tool-augmented task solving and language feedback.

Provide a Python-execution interface for tools and use GPT-4 to simulate human feedback, enabling scalable evaluation.

Curate a compact set of 586 challenging instances from 8 public datasets covering reasoning, coding, and decision-making.

Evaluate 20 LLMs (4 closed, 16 open) and quantify per-turn tool gains and gains from natural-language feedback.

Identify practical failure modes and data artifacts that harm multi-turn interaction (formatting, training-data artifacts).

Key Findings

Tool interaction gives consistent, per-turn success gains.

Numbers1–8% absolute SR gain per extra tool turn (micro-avg across tasks)

Natural-language feedback (simulated by GPT-4) boosts success.

Numbers+2–17% absolute SR gain with GPT-4 feedback (k=5)

Single-turn accuracy does not predict multi-turn gains.

Numbersclaude-instant-1 SR5 45.9% vs claude-2 SR5 39.9%; instant-1 overtakes as k increases

Instruction finetuning and RLHF sometimes harm multi-turn performance.

NumbersSIFT hurt CodeLlama-34B by 11.1% (no feedback) and 15.4% (with feedback); RLHF hurt LLaMA-2-70B by ~8.5%–8.7%

GPT-4 simulated feedback is close to human feedback by human judgment.

Numbers91.2% of GPT-4 feedback judged as as-helpful-or-better; humans found GPT-4 feedback human-like in ~92% of cases

Results

Tool-augmented gain per extra turn (∆tools)

Value1–8% absolute SR gain per extra tool turn (micro-avg across tasks)

BaselineSR at k=1 (no interaction)

Natural-language feedback gain (∆feedback)

Value2–17% absolute SR gain with GPT-4 feedback (k=5)

BaselineSR with k=5 without feedback

Open vs closed-source absolute SR with feedback

Valuebest open-source SRfeedback5 37.0% (Lemur-70B-SIFT) vs best closed-source 45.9% (claude-instant-1)

Baselineclosed-source top model

Human evaluation of GPT-4 feedback

Value91.2% judged GPT-4 feedback as as-helpful-or-better; 92% judged GPT-4 feedback human-like or indistinguishable

Baselinehuman-written feedback

Who Should Care

What To Try In 7 Days

Run MINT (k=5) on your candidate models to compare multi-turn SR, not just single-shot accuracy.

Simulate short natural-language feedback with a high-quality model (e.g., GPT-4) to estimate gains before paying human annotators.

Test base vs. SIFT vs. RLHF variants for your use case; alignment steps can degrade multi-turn tool use.

Agent Features

Memory

  • short-term interaction history across turns (k ≤ 5)

Planning

  • multi-turn tool-driven planning

Tool Use

  • Python code execution as unified tool interface
  • wiki search (for reasoning tasks)
  • ALFWorld action API (decision-making)

Frameworks

  • GPT-4 used to simulate natural-language feedback

Is Agentic

true

Architectures

  • chat-style LLMs
  • base LLMs

Collaboration

  • user-LLM-tool loop with simulated human feedback

Optimization Features

Training Optimization

  • analysis of SIFT and RLHF effects on multi-turn performance

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Feedback is simulated by GPT-4; it is close to humans in tests but may not cover all real-user behaviors or value judgments.
  • MINT uses a compact subset (586 examples) for tractability; broader populations may change model rankings.
  • Metrics focus on success rate; they do not directly measure interaction quality, cost, or user satisfaction.
  • Some evaluated models fail to follow output-format instructions, reducing measured performance in ways unrelated to core capability.

When Not To Use

  • When you need full-scale human feedback across varied user populations for normative judgments.
  • When your tooling set differs substantially from Python-executable tools or the ALFWorld/A.P.I.s used here.
  • If you need fine-grained assessment of conversational quality beyond task success.

Failure Modes

  • Formatting failures: some models ignore requested tags causing unparsable outputs and lowered SR (Table A.7).
  • Training-data artifacts: Vicuna models inject backslash-escaped underscores causing syntax errors (Table A.9).
  • SIFT / RLHF regression: instruction finetuning or RLHF sometimes reduces multi-turn gains.
  • Overfitting to training tags: CodeLLaMA-Instruct outputs [PYTHON] tags regardless of prompt, breaking parser (Table A.10).

Core Entities

Models

  • gpt-3.5-turbo-0613
  • gpt-4-0613
  • claude-2
  • claude-instant-1
  • chat-bison-001 (Bard)
  • LLaMA-2 (7B,13B,70B)
  • Vicuna-v1.5 (7B,13B)
  • CodeLLaMA (7B,13B,34B)
  • Lemur-v1-70B

Metrics

  • Success Rate (SR)
  • ∆tools (improvement per interaction turn)
  • ∆feedback (improvement from natural-language feedback)

Datasets

  • HumanEval
  • MBPP
  • ALFWorld
  • GSM8K
  • HotpotQA
  • MATH
  • MMLU
  • TheoremQA

Benchmarks

  • MINT

Context Entities

Models

  • Vicuna (SIFT on ShareGPT)
  • CodeLLaMA-Instruct (SIFT)