Overview
Paper provides a targeted diagnostic benchmark and multi-model evaluation, so conclusions about parameter-language mismatch are well supported for single-turn settings but not exhaustive across all languages or multi-turn agents.
Citations0
Evidence Strength0.78
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 0/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 35%
Production readiness: 50%
Novelty: 65%
Why It Matters For Business
If your product invokes external APIs from an LLM, non-English user inputs can produce non-executable calls even when intent is correct, risking silent failures and poor global UX.
Who Should Care
Summary TLDR
This paper introduces MLCL, a diagnostic benchmark that tests how LLMs perform tool (function) calls when user queries are in Chinese, Hindi, or Igbo. Main finding: models often pick the right tool and intent but place parameter values in the user's language (e.g., Chinese), making calls non-executable under English-only tool interfaces — a failure the authors call "parameter value language mismatch." Simple inference-time fixes (explicit prompts, pre- or post-translation) reduce but do not fully restore English-level reliability, and behavior differs across high- and low-resource languages.
Problem Statement
LLM-based agents must generate structured tool calls (function name + parameter keys + parameter values) that satisfy rigid, typically English-only execution interfaces. Existing benchmarks mostly use English queries, so we lack a clear picture of how multilingual user queries affect tool-calling reliability and what kinds of failures occur at the language–execution boundary.
Main Contribution
MLCL benchmark: a diagnostic multilingual extension of BFCL covering Chinese, Hindi, and Igbo with controlled query translation and semantic perturbations.
Fine-grained error taxonomy separating execution-level violations (e.g., language mismatch) from semantic errors.
Key Findings
Parameter value language mismatch is the main cause of execution failures when queries are fully translated to non-English.
Partially translating queries (preserving English parameter tokens) greatly reduces language-mismatch failures and can, for some models, match or exceed English reference performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Execution error composition | FT increases execution-level errors; dominated by parameter-language mismatch | NT (English queries) | substantial increase in FT vs NT (no numeric reported) | MLCL (Chinese, Hindi, Igbo) | Figure 4 and Section 3.4 describe larger execution errors in FT dominated by language mismatch | Figure 4; Sec.3.4 |
| Effect of partial translation (PAR) | PAR substantially reduces parameter-language mismatch and can match NT for some models | FT (Fully translated) | reduction in execution-level errors (no numeric reported) | MLCL (all languages tested) | Section 3.4 and Figure 4 report PAR reduces mismatch errors, sometimes matching English reference | Figure 4; Sec.3.4 |
What To Try In 7 Days
Run the MLCL suite on your tool-calling pipeline to measure multilingual gaps.
Add an input-normalization step that preserves or maps parameter tokens to canonical English identifiers (partial translation).
Add a validation layer that blocks or canonicalizes non-English parameter values before execution and logs mismatches.
Agent Features
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Single-turn tool-calling only; results may not generalize to multi-turn agents.
Only three languages studied (Chinese, Hindi, Igbo); not globally exhaustive.
When Not To Use
If your tool interface accepts multilingual parameter values, MLCL's errors about English-only strings may not apply.
If your system is multi-turn or uses dialog context heavily, single-turn findings might miss interaction effects.
Failure Modes
Parameter value language mismatch: model copies non-English tokens into parameter values.
Semantic drift from translation normalization: translated parameters no longer match expected surface forms.

