Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Switching to a single soft-prompted model cuts model count and maintenance while keeping similar accuracy; using LLM-assisted label transfer lowers non-English annotation costs and speeds roll-out across languages.
Summary TLDR
This paper shows two practical ways to scale real-time game chat moderation. First, add a short GAME_TYPE_TOKEN (a soft prompt) before chat context so one small model (ToxBuster) handles multiple games with near-equal accuracy to more costly schemes (macro F1 ≈43%). Second, use GPT-4o-mini to transfer labels from 15 existing datasets into a new MLSNT multilingual dataset, keeping only items where human annotators and the LLM agree to raise label quality (binary weighted F1 up to 86.66%, and agreed toxic precision ~79.12%). Human checks on live chat show macro F1 varies by language (19–59%), highlighting where more data or human review is still needed. In production at Ubisoft the unified set
Problem Statement
Game studios must detect toxic chat in real time across many games and languages while keeping compute, latency, and annotation costs low. Maintaining separate per-game and per-language models is costly and hard to scale. The paper tackles: (1) unifying game-specific models into one deployable model, and (2) extending coverage to multiple languages without prohibitive human annotation costs.
Main Contribution
A soft-prompting method (GAME_TYPE_TOKEN) that unifies multiple game models while preserving performance and improving scalability
An LLM-assisted label transfer pipeline that converts 15 open datasets into a new MLSNT multilingual toxicity dataset
A production-ready unified ToxBuster model supporting multiple games and seven languages with concrete production metrics
Key Findings
A single soft-prompted model matches multi-step curriculum learning on game-level detection.
LLM label transfer can produce high-quality binary toxic/non-toxic labels.
Filtering to cases where humans and the LLM agree boosts toxic-class reliability.
Multilingual detection quality varies widely by language in live game chat.
Unified system has measurable production impact at Ubisoft.
Results
Macro F1-score (cross-game, soft-prompting)
GPT-4o-mini label transfer (binary weighted F1)
Agreed toxic label performance (human+LLM)
Human-eval macro F1 per language (sampled game chat)
Production signal
Who Should Care
What To Try In 7 Days
Add a GAME_TYPE_TOKEN before chat context and train a single classifier on mixed game data to test unified deployment.
Run GPT-4o-mini label transfer on one non-English dataset and keep only human+LLM-agreed labels to seed multilingual training.
Compare xlm-roberta-base and xlm-roberta-base-adapted on a week of in-game chat to decide domain adaptation needs.
Optimization Features
Infra Optimization
- multi-lingual variants kept under 300M params to limit latency
Model Optimization
- soft-prompting (prefix tokens)
System Optimization
- single-model deployment reduces maintenance
Training Optimization
- mixed-dataset training
- domain adaptation via continued MLM pretraining
Inference Optimization
- use of small language models (SLMs) for real-time inference
Reproducibility
Data Urls
- ComplexDataLab/MLSNT (paper claims release)
- https://arxiv.org/abs/2506.06347
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation is limited to two Ubisoft games; other genres may differ.
- LLM-assisted transfer discards 10–70% of lines, so coverage varies by dataset.
- Language ID (Lingua-py) is weaker on short game chat, affecting language selection.
- Strong per-language performance variation; Japanese performance was especially poor due to adversarial/translated chat.
When Not To Use
- When a language has very little clean training data or adversarial translations.
- When strict fine-grained toxicity labels are required without human review.
- If per-game top accuracy must exceed curriculum learning and you can afford per-game training costs.
Failure Modes
- LLM label transfer may inherit dataset annotation bias and miss subtle cultural context.
- Filtering approach excludes ambiguous cases, pushing difficult decisions to costly human review.
- Soft-prompting may underperform if GAME_TYPE_TOKEN is misplaced or omitted at inference.
- Domain shift between dataset text and live chat (slang, translations) reduces accuracy.
Core Entities
Models
- ToxBuster
- bert-base-uncased
- xlm-roberta-base
- xlm-roberta-base-adapted
- GPT-4o-mini
Metrics
- macro F1-score
- weighted F1-score
Datasets
- MLSNT
- COLD
- SWSR
- TOXICN
- TOCAB
- MLMA
- GAHD
- GERM_EVAL
- HASOC
- Inspection AI
- LLM_JP
- OffCom
- OLID
- ToLD
- Abusive
- South_Park
Context Entities
Models
- distilbert
- deberta
- roberta-base
Metrics
- precision
- recall

