Multilingual user queries break tool calls when models put non‑English text into parameters.

January 8, 20268 min

Overview

Decision SnapshotReady For Pilot

Paper provides a targeted diagnostic benchmark and multi-model evaluation, so conclusions about parameter-language mismatch are well supported for single-turn settings but not exhaustive across all languages or multi-turn agents.

Citations0

Evidence Strength0.78

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 0/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 35%

Production readiness: 50%

Novelty: 65%

Authors

Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, Wanpeng Xu, Hua Wei, Xiyang Hu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If your product invokes external APIs from an LLM, non-English user inputs can produce non-executable calls even when intent is correct, risking silent failures and poor global UX.

Who Should Care

Summary TLDR

This paper introduces MLCL, a diagnostic benchmark that tests how LLMs perform tool (function) calls when user queries are in Chinese, Hindi, or Igbo. Main finding: models often pick the right tool and intent but place parameter values in the user's language (e.g., Chinese), making calls non-executable under English-only tool interfaces — a failure the authors call "parameter value language mismatch." Simple inference-time fixes (explicit prompts, pre- or post-translation) reduce but do not fully restore English-level reliability, and behavior differs across high- and low-resource languages.

Problem Statement

LLM-based agents must generate structured tool calls (function name + parameter keys + parameter values) that satisfy rigid, typically English-only execution interfaces. Existing benchmarks mostly use English queries, so we lack a clear picture of how multilingual user queries affect tool-calling reliability and what kinds of failures occur at the language–execution boundary.

Main Contribution

MLCL benchmark: a diagnostic multilingual extension of BFCL covering Chinese, Hindi, and Igbo with controlled query translation and semantic perturbations.

Fine-grained error taxonomy separating execution-level violations (e.g., language mismatch) from semantic errors.

Key Findings

Parameter value language mismatch is the main cause of execution failures when queries are fully translated to non-English.

Practical UseIf your tool interface expects English strings, models may produce non-executable calls for non-English queries even when intent is correct; add checks or normalization before execution.

Evidence RefFigure 4; Section 3.4

Partially translating queries (preserving English parameter tokens) greatly reduces language-mismatch failures and can, for some models, match or exceed English reference performance.

Practical UseFor deployed systems, consider keeping parameter-like keywords in English (or normalizing them) to avoid many execution errors without retraining models.

Evidence RefSection 3.4; Figure 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Execution error compositionFT increases execution-level errors; dominated by parameter-language mismatchNT (English queries)substantial increase in FT vs NT (no numeric reported)MLCL (Chinese, Hindi, Igbo)Figure 4 and Section 3.4 describe larger execution errors in FT dominated by language mismatchFigure 4; Sec.3.4
Effect of partial translation (PAR)PAR substantially reduces parameter-language mismatch and can match NT for some modelsFT (Fully translated)reduction in execution-level errors (no numeric reported)MLCL (all languages tested)Section 3.4 and Figure 4 report PAR reduces mismatch errors, sometimes matching English referenceFigure 4; Sec.3.4

What To Try In 7 Days

Run the MLCL suite on your tool-calling pipeline to measure multilingual gaps.

Add an input-normalization step that preserves or maps parameter tokens to canonical English identifiers (partial translation).

Add a validation layer that blocks or canonicalizes non-English parameter values before execution and logs mismatches.

Agent Features

Planning
function calling (structured API invocation)
Tool Use
structured function callsAPI parameter generationtool selection
Frameworks
BFCL protocolMLCL diagnostic protocol
Is Agentic

Yes

Architectures
decoder-only LLMs (GPT-5 family)encoder-decoder variants not emphasizedMoELlama familyGranite family

Optimization Features

System Optimization
input normalization and parameter canonicalization
Inference Optimization
prompt-level instruction (PT)pre-translation input normalization (PRE)post-translation of parameters (POST)

Reproducibility

Risks & Boundaries

Limitations

Single-turn tool-calling only; results may not generalize to multi-turn agents.

Only three languages studied (Chinese, Hindi, Igbo); not globally exhaustive.

When Not To Use

If your tool interface accepts multilingual parameter values, MLCL's errors about English-only strings may not apply.

If your system is multi-turn or uses dialog context heavily, single-turn findings might miss interaction effects.

Failure Modes

Parameter value language mismatch: model copies non-English tokens into parameter values.

Semantic drift from translation normalization: translated parameters no longer match expected surface forms.

Core Entities

Models

GPT-5 (family: GPT-5, GPT-5 mini, GPT-5 nano)DeepSeek V3.2Llama 3.1 (8B, 70B Instruct)Qwen 3 (8B, 14B, 30B-A3B, 32B, Next-80B-A3B)Granite 4 (micro/small variants)

Metrics

overall error rate (exact-match tool-call correctness)execution-level error breakdown (syntax, function-level, parameter language mismatch)parameter value language mismatch rate

Datasets

Berkeley Function Calling Leaderboard (BFCL v4)MLCL (this paper's multilingual diagnostic extension)

Benchmarks

BFCLMLCL