Use soft prompts and LLM label transfer to scale real-time in-game toxicity detection across games and languages

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

Authors

Zachary Yang, Domenico Tullo, Reihaneh Rabbany

Links

Abstract / PDF

Why It Matters For Business

Switching to a single soft-prompted model cuts model count and maintenance while keeping similar accuracy; using LLM-assisted label transfer lowers non-English annotation costs and speeds roll-out across languages.

Summary TLDR

This paper shows two practical ways to scale real-time game chat moderation. First, add a short GAME_TYPE_TOKEN (a soft prompt) before chat context so one small model (ToxBuster) handles multiple games with near-equal accuracy to more costly schemes (macro F1 ≈43%). Second, use GPT-4o-mini to transfer labels from 15 existing datasets into a new MLSNT multilingual dataset, keeping only items where human annotators and the LLM agree to raise label quality (binary weighted F1 up to 86.66%, and agreed toxic precision ~79.12%). Human checks on live chat show macro F1 varies by language (19–59%), highlighting where more data or human review is still needed. In production at Ubisoft the unified set

Problem Statement

Game studios must detect toxic chat in real time across many games and languages while keeping compute, latency, and annotation costs low. Maintaining separate per-game and per-language models is costly and hard to scale. The paper tackles: (1) unifying game-specific models into one deployable model, and (2) extending coverage to multiple languages without prohibitive human annotation costs.

Main Contribution

A soft-prompting method (GAME_TYPE_TOKEN) that unifies multiple game models while preserving performance and improving scalability

An LLM-assisted label transfer pipeline that converts 15 open datasets into a new MLSNT multilingual toxicity dataset

A production-ready unified ToxBuster model supporting multiple games and seven languages with concrete production metrics

Key Findings

A single soft-prompted model matches multi-step curriculum learning on game-level detection.

NumbersSoft prompting overall macro F1 = 43.16% vs curriculum best 43.35% (Table 1)

LLM label transfer can produce high-quality binary toxic/non-toxic labels.

NumbersGPT-4o-mini binary weighted F1 = 86.66% on full agreed dataset (Table 3)

Filtering to cases where humans and the LLM agree boosts toxic-class reliability.

NumbersAgreed toxic performance = 79.12% vs unfiltered class-wise 84.48%; filtering yields ~40% relative gain on toxic category

Multilingual detection quality varies widely by language in live game chat.

NumbersHuman-eval macro F1 range 19.07% (Japanese) to 58.88% (German) (Table 8)

Unified system has measurable production impact at Ubisoft.

NumbersSystem flags on average 50 players per game per day for sanctionable behavior (Abstract)

Results

Macro F1-score (cross-game, soft-prompting)

Value43.16% overall (soft prompting, game-aware)

BaselineSingle-game overall 40.82%

GPT-4o-mini label transfer (binary weighted F1)

Value86.66%

Agreed toxic label performance (human+LLM)

Value79.12% (agreed toxic class-wise F1)

BaselineUnfiltered class-wise 84.48%

Human-eval macro F1 per language (sampled game chat)

ValueRange 19.07% to 58.88%

BaselineEnglish GAME_1 = 45.39% (Table 7/8)

Production signal

Value≈50 players flagged per game per day

Who Should Care

Product ManagerMl EngineerEngineering LeadData Scientist

What To Try In 7 Days

Add a GAME_TYPE_TOKEN before chat context and train a single classifier on mixed game data to test unified deployment.

Run GPT-4o-mini label transfer on one non-English dataset and keep only human+LLM-agreed labels to seed multilingual training.

Compare xlm-roberta-base and xlm-roberta-base-adapted on a week of in-game chat to decide domain adaptation needs.

Optimization Features

Infra Optimization

multi-lingual variants kept under 300M params to limit latency

Model Optimization

soft-prompting (prefix tokens)

System Optimization

single-model deployment reduces maintenance

Training Optimization

mixed-dataset training
domain adaptation via continued MLM pretraining

Inference Optimization

use of small language models (SLMs) for real-time inference

Reproducibility

Data Urls

ComplexDataLab/MLSNT (paper claims release)
https://arxiv.org/abs/2506.06347

Data Available

Open Source Status

partial

Risks & Boundaries

Limitations

Evaluation is limited to two Ubisoft games; other genres may differ.
LLM-assisted transfer discards 10–70% of lines, so coverage varies by dataset.
Language ID (Lingua-py) is weaker on short game chat, affecting language selection.
Strong per-language performance variation; Japanese performance was especially poor due to adversarial/translated chat.

When Not To Use

When a language has very little clean training data or adversarial translations.
When strict fine-grained toxicity labels are required without human review.
If per-game top accuracy must exceed curriculum learning and you can afford per-game training costs.

Failure Modes

LLM label transfer may inherit dataset annotation bias and miss subtle cultural context.
Filtering approach excludes ambiguous cases, pushing difficult decisions to costly human review.
Soft-prompting may underperform if GAME_TYPE_TOKEN is misplaced or omitted at inference.
Domain shift between dataset text and live chat (slang, translations) reduces accuracy.

Core Entities

Models

ToxBuster
bert-base-uncased
xlm-roberta-base
xlm-roberta-base-adapted
GPT-4o-mini

Metrics

macro F1-score
weighted F1-score

Datasets

MLSNT
COLD
SWSR
TOXICN
TOCAB
MLMA
GAHD
GERM_EVAL
HASOC
Inspection AI
LLM_JP
OffCom
OLID
ToLD
Abusive
South_Park

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A single soft-prompted model matches multi-step curriculum learning on game-level detection.

LLM label transfer can produce high-quality binary toxic/non-toxic labels.

Filtering to cases where humans and the LLM agree boosts toxic-class reliability.

Multilingual detection quality varies widely by language in live game chat.

Unified system has measurable production impact at Ubisoft.

Results

Macro F1-score (cross-game, soft-prompting)

GPT-4o-mini label transfer (binary weighted F1)

Agreed toxic label performance (human+LLM)

Human-eval macro F1 per language (sampled game chat)

Production signal

Who Should Care

What To Try In 7 Days

Optimization Features

Infra Optimization

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Data Urls

Data Available

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Related Papers