Multilingual user queries break tool calls when models put non‑English text into parameters.

January 8, 20268 min

Overview

Production Readiness

0.5

Novelty Score

0.65

Cost Impact Score

0.35

Citation Count

0

Authors

Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, Wanpeng Xu, Hua Wei, Xiyang Hu

Links

Abstract / PDF

Why It Matters For Business

If your product invokes external APIs from an LLM, non-English user inputs can produce non-executable calls even when intent is correct, risking silent failures and poor global UX.

Summary TLDR

This paper introduces MLCL, a diagnostic benchmark that tests how LLMs perform tool (function) calls when user queries are in Chinese, Hindi, or Igbo. Main finding: models often pick the right tool and intent but place parameter values in the user's language (e.g., Chinese), making calls non-executable under English-only tool interfaces — a failure the authors call "parameter value language mismatch." Simple inference-time fixes (explicit prompts, pre- or post-translation) reduce but do not fully restore English-level reliability, and behavior differs across high- and low-resource languages.

Problem Statement

LLM-based agents must generate structured tool calls (function name + parameter keys + parameter values) that satisfy rigid, typically English-only execution interfaces. Existing benchmarks mostly use English queries, so we lack a clear picture of how multilingual user queries affect tool-calling reliability and what kinds of failures occur at the language–execution boundary.

Main Contribution

MLCL benchmark: a diagnostic multilingual extension of BFCL covering Chinese, Hindi, and Igbo with controlled query translation and semantic perturbations.

Fine-grained error taxonomy separating execution-level violations (e.g., language mismatch) from semantic errors.

Empirical analysis showing parameter value language mismatch as a dominant failure mode and evaluation of three lightweight mitigation strategies (PT, PRE, POST).

Key Findings

Parameter value language mismatch is the main cause of execution failures when queries are fully translated to non-English.

Partially translating queries (preserving English parameter tokens) greatly reduces language-mismatch failures and can, for some models, match or exceed English reference performance.

Lightweight inference strategies (explicit prompting/PT, pre-translation/PRE, post-translation/POST) reduce language-induced errors but none fully recover English-level performance.

Error composition differs by language: high-resource languages (Chinese, Hindi) show more parameter-language copying; low-resource Igbo shows more semantic-understanding errors.

Translation fixes can introduce new failures via semantic drift and surface-form normalization, so normalization is not a perfect cure.

Authors release code and dataset for the benchmark.

Results

Execution error composition

ValueFT increases execution-level errors; dominated by parameter-language mismatch

BaselineNT (English queries)

Effect of partial translation (PAR)

ValuePAR substantially reduces parameter-language mismatch and can match NT for some models

BaselineFT (Fully translated)

Effectiveness of inference-time mitigations

ValuePT, PRE, POST reduce mismatch but none restore NT performance; PRE often best

BaselineFT

Who Should Care

What To Try In 7 Days

Run the MLCL suite on your tool-calling pipeline to measure multilingual gaps.

Add an input-normalization step that preserves or maps parameter tokens to canonical English identifiers (partial translation).

Add a validation layer that blocks or canonicalizes non-English parameter values before execution and logs mismatches.

Agent Features

Planning

  • function calling (structured API invocation)

Tool Use

  • structured function calls
  • API parameter generation
  • tool selection

Frameworks

  • BFCL protocol
  • MLCL diagnostic protocol

Is Agentic

true

Architectures

  • decoder-only LLMs (GPT-5 family)
  • encoder-decoder variants not emphasized
  • MoE
  • Llama family
  • Granite family

Optimization Features

System Optimization

  • input normalization and parameter canonicalization

Inference Optimization

  • prompt-level instruction (PT)
  • pre-translation input normalization (PRE)
  • post-translation of parameters (POST)

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single-turn tool-calling only; results may not generalize to multi-turn agents.
  • Only three languages studied (Chinese, Hindi, Igbo); not globally exhaustive.
  • Assumes English-only execution interfaces; some real systems may accept multilingual parameter values.
  • Mitigations are simple inference-time probes, not optimized retraining solutions.

When Not To Use

  • If your tool interface accepts multilingual parameter values, MLCL's errors about English-only strings may not apply.
  • If your system is multi-turn or uses dialog context heavily, single-turn findings might miss interaction effects.

Failure Modes

  • Parameter value language mismatch: model copies non-English tokens into parameter values.
  • Semantic drift from translation normalization: translated parameters no longer match expected surface forms.
  • Query understanding failures in low-resource languages (semantic misinterpretation).
  • Syntax and function-schema errors that prevent parsing.

Core Entities

Models

  • GPT-5 (family: GPT-5, GPT-5 mini, GPT-5 nano)
  • DeepSeek V3.2
  • Llama 3.1 (8B, 70B Instruct)
  • Qwen 3 (8B, 14B, 30B-A3B, 32B, Next-80B-A3B)
  • Granite 4 (micro/small variants)

Metrics

  • overall error rate (exact-match tool-call correctness)
  • execution-level error breakdown (syntax, function-level, parameter language mismatch)
  • parameter value language mismatch rate

Datasets

  • Berkeley Function Calling Leaderboard (BFCL v4)
  • MLCL (this paper's multilingual diagnostic extension)

Benchmarks

  • BFCL
  • MLCL