Multilingual user queries break tool calls when models put non‑English text into parameters.

Overview

Decision SnapshotReady For Pilot

Paper provides a targeted diagnostic benchmark and multi-model evaluation, so conclusions about parameter-language mismatch are well supported for single-turn settings but not exhaustive across all languages or multi-turn agents.

Citations0

Evidence Strength0.78

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 0/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 50%

Novelty: 65%

Authors

Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, Wanpeng Xu, Hua Wei, Xiyang Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product invokes external APIs from an LLM, non-English user inputs can produce non-executable calls even when intent is correct, risking silent failures and poor global UX.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper introduces MLCL, a diagnostic benchmark that tests how LLMs perform tool (function) calls when user queries are in Chinese, Hindi, or Igbo. Main finding: models often pick the right tool and intent but place parameter values in the user's language (e.g., Chinese), making calls non-executable under English-only tool interfaces — a failure the authors call "parameter value language mismatch." Simple inference-time fixes (explicit prompts, pre- or post-translation) reduce but do not fully restore English-level reliability, and behavior differs across high- and low-resource languages.

Problem Statement

LLM-based agents must generate structured tool calls (function name + parameter keys + parameter values) that satisfy rigid, typically English-only execution interfaces. Existing benchmarks mostly use English queries, so we lack a clear picture of how multilingual user queries affect tool-calling reliability and what kinds of failures occur at the language–execution boundary.

Main Contribution

MLCL benchmark: a diagnostic multilingual extension of BFCL covering Chinese, Hindi, and Igbo with controlled query translation and semantic perturbations.

Fine-grained error taxonomy separating execution-level violations (e.g., language mismatch) from semantic errors.

Key Findings

Parameter value language mismatch is the main cause of execution failures when queries are fully translated to non-English.

Practical UseIf your tool interface expects English strings, models may produce non-executable calls for non-English queries even when intent is correct; add checks or normalization before execution.

Evidence RefFigure 4; Section 3.4

Partially translating queries (preserving English parameter tokens) greatly reduces language-mismatch failures and can, for some models, match or exceed English reference performance.

Practical UseFor deployed systems, consider keeping parameter-like keywords in English (or normalizing them) to avoid many execution errors without retraining models.

Evidence RefSection 3.4; Figure 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Execution error composition	FT increases execution-level errors; dominated by parameter-language mismatch	NT (English queries)	substantial increase in FT vs NT (no numeric reported)	MLCL (Chinese, Hindi, Igbo)	Figure 4 and Section 3.4 describe larger execution errors in FT dominated by language mismatch	Figure 4; Sec.3.4
Effect of partial translation (PAR)	PAR substantially reduces parameter-language mismatch and can match NT for some models	FT (Fully translated)	reduction in execution-level errors (no numeric reported)	MLCL (all languages tested)	Section 3.4 and Figure 4 report PAR reduces mismatch errors, sometimes matching English reference	Figure 4; Sec.3.4

What To Try In 7 Days

Run the MLCL suite on your tool-calling pipeline to measure multilingual gaps.

Add an input-normalization step that preserves or maps parameter tokens to canonical English identifiers (partial translation).

Add a validation layer that blocks or canonicalizes non-English parameter values before execution and logs mismatches.

Agent Features

Planning

function calling (structured API invocation)

Tool Use

structured function callsAPI parameter generationtool selection

Frameworks

BFCL protocolMLCL diagnostic protocol

Is Agentic

Yes

Architectures

decoder-only LLMs (GPT-5 family)encoder-decoder variants not emphasizedMoELlama familyGranite family

Optimization Features

System Optimization

input normalization and parameter canonicalization

Inference Optimization

prompt-level instruction (PT)pre-translation input normalization (PRE)post-translation of parameters (POST)

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://anonymous.4open.science/r/multilingual_robustness_tool_calling-CA44

Data URLs

https://anonymous.4open.science/r/multilingual_robustness_tool_calling-CA44

Risks & Boundaries

Limitations

Single-turn tool-calling only; results may not generalize to multi-turn agents.

Only three languages studied (Chinese, Hindi, Igbo); not globally exhaustive.

When Not To Use

If your tool interface accepts multilingual parameter values, MLCL's errors about English-only strings may not apply.

If your system is multi-turn or uses dialog context heavily, single-turn findings might miss interaction effects.

Failure Modes

Parameter value language mismatch: model copies non-English tokens into parameter values.

Semantic drift from translation normalization: translated parameters no longer match expected surface forms.

Core Entities

Models

GPT-5 (family: GPT-5, GPT-5 mini, GPT-5 nano)DeepSeek V3.2Llama 3.1 (8B, 70B Instruct)Qwen 3 (8B, 14B, 30B-A3B, 32B, Next-80B-A3B)Granite 4 (micro/small variants)

Metrics

overall error rate (exact-match tool-call correctness)execution-level error breakdown (syntax, function-level, parameter language mismatch)parameter value language mismatch rate

Datasets

Berkeley Function Calling Leaderboard (BFCL v4)MLCL (this paper's multilingual diagnostic extension)

Benchmarks

BFCLMLCL

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Parameter value language mismatch is the main cause of execution failures when queries are fully translated to non-English.

Partially translating queries (preserving English parameter tokens) greatly reduces language-mismatch failures and can, for some models, match or exceed English reference performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Use all LLMs as judges: a fast, democratic way to rank models that matches human preference

Key finding

Judge with hidden states: use small models' internal vectors instead of prompting large LLMs

Key finding

DIBJUDGE: fine-tune judges to separate true quality signals from translation artifacts

Key finding

LLM judges favor 'new' and 'expert' labels but never admit it.

Key finding

Confuse the judge: a black-box method that labels LLM evaluations as high or low uncertainty

Key finding