Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks
ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.
Key finding
ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.
Numbers: XNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

