Overview
Production Readiness
0.5
Novelty Score
0.65
Cost Impact Score
0.35
Citation Count
0
Why It Matters For Business
If your product invokes external APIs from an LLM, non-English user inputs can produce non-executable calls even when intent is correct, risking silent failures and poor global UX.
Summary TLDR
This paper introduces MLCL, a diagnostic benchmark that tests how LLMs perform tool (function) calls when user queries are in Chinese, Hindi, or Igbo. Main finding: models often pick the right tool and intent but place parameter values in the user's language (e.g., Chinese), making calls non-executable under English-only tool interfaces — a failure the authors call "parameter value language mismatch." Simple inference-time fixes (explicit prompts, pre- or post-translation) reduce but do not fully restore English-level reliability, and behavior differs across high- and low-resource languages.
Problem Statement
LLM-based agents must generate structured tool calls (function name + parameter keys + parameter values) that satisfy rigid, typically English-only execution interfaces. Existing benchmarks mostly use English queries, so we lack a clear picture of how multilingual user queries affect tool-calling reliability and what kinds of failures occur at the language–execution boundary.
Main Contribution
MLCL benchmark: a diagnostic multilingual extension of BFCL covering Chinese, Hindi, and Igbo with controlled query translation and semantic perturbations.
Fine-grained error taxonomy separating execution-level violations (e.g., language mismatch) from semantic errors.
Empirical analysis showing parameter value language mismatch as a dominant failure mode and evaluation of three lightweight mitigation strategies (PT, PRE, POST).
Key Findings
Parameter value language mismatch is the main cause of execution failures when queries are fully translated to non-English.
Partially translating queries (preserving English parameter tokens) greatly reduces language-mismatch failures and can, for some models, match or exceed English reference performance.
Lightweight inference strategies (explicit prompting/PT, pre-translation/PRE, post-translation/POST) reduce language-induced errors but none fully recover English-level performance.
Error composition differs by language: high-resource languages (Chinese, Hindi) show more parameter-language copying; low-resource Igbo shows more semantic-understanding errors.
Translation fixes can introduce new failures via semantic drift and surface-form normalization, so normalization is not a perfect cure.
Authors release code and dataset for the benchmark.
Results
Execution error composition
Effect of partial translation (PAR)
Effectiveness of inference-time mitigations
Who Should Care
What To Try In 7 Days
Run the MLCL suite on your tool-calling pipeline to measure multilingual gaps.
Add an input-normalization step that preserves or maps parameter tokens to canonical English identifiers (partial translation).
Add a validation layer that blocks or canonicalizes non-English parameter values before execution and logs mismatches.
Agent Features
Planning
- function calling (structured API invocation)
Tool Use
- structured function calls
- API parameter generation
- tool selection
Frameworks
- BFCL protocol
- MLCL diagnostic protocol
Is Agentic
true
Architectures
- decoder-only LLMs (GPT-5 family)
- encoder-decoder variants not emphasized
- MoE
- Llama family
- Granite family
Optimization Features
System Optimization
- input normalization and parameter canonicalization
Inference Optimization
- prompt-level instruction (PT)
- pre-translation input normalization (PRE)
- post-translation of parameters (POST)
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single-turn tool-calling only; results may not generalize to multi-turn agents.
- Only three languages studied (Chinese, Hindi, Igbo); not globally exhaustive.
- Assumes English-only execution interfaces; some real systems may accept multilingual parameter values.
- Mitigations are simple inference-time probes, not optimized retraining solutions.
When Not To Use
- If your tool interface accepts multilingual parameter values, MLCL's errors about English-only strings may not apply.
- If your system is multi-turn or uses dialog context heavily, single-turn findings might miss interaction effects.
Failure Modes
- Parameter value language mismatch: model copies non-English tokens into parameter values.
- Semantic drift from translation normalization: translated parameters no longer match expected surface forms.
- Query understanding failures in low-resource languages (semantic misinterpretation).
- Syntax and function-schema errors that prevent parsing.
Core Entities
Models
- GPT-5 (family: GPT-5, GPT-5 mini, GPT-5 nano)
- DeepSeek V3.2
- Llama 3.1 (8B, 70B Instruct)
- Qwen 3 (8B, 14B, 30B-A3B, 32B, Next-80B-A3B)
- Granite 4 (micro/small variants)
Metrics
- overall error rate (exact-match tool-call correctness)
- execution-level error breakdown (syntax, function-level, parameter language mismatch)
- parameter value language mismatch rate
Datasets
- Berkeley Function Calling Leaderboard (BFCL v4)
- MLCL (this paper's multilingual diagnostic extension)
Benchmarks
- BFCL
- MLCL

