Use soft prompts and LLM label transfer to scale real-time in-game toxicity detection across games and languages

June 1, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Zachary Yang, Domenico Tullo, Reihaneh Rabbany

Links

Abstract / PDF

Why It Matters For Business

Switching to a single soft-prompted model cuts model count and maintenance while keeping similar accuracy; using LLM-assisted label transfer lowers non-English annotation costs and speeds roll-out across languages.

Summary TLDR

This paper shows two practical ways to scale real-time game chat moderation. First, add a short GAME_TYPE_TOKEN (a soft prompt) before chat context so one small model (ToxBuster) handles multiple games with near-equal accuracy to more costly schemes (macro F1 ≈43%). Second, use GPT-4o-mini to transfer labels from 15 existing datasets into a new MLSNT multilingual dataset, keeping only items where human annotators and the LLM agree to raise label quality (binary weighted F1 up to 86.66%, and agreed toxic precision ~79.12%). Human checks on live chat show macro F1 varies by language (19–59%), highlighting where more data or human review is still needed. In production at Ubisoft the unified set

Problem Statement

Game studios must detect toxic chat in real time across many games and languages while keeping compute, latency, and annotation costs low. Maintaining separate per-game and per-language models is costly and hard to scale. The paper tackles: (1) unifying game-specific models into one deployable model, and (2) extending coverage to multiple languages without prohibitive human annotation costs.

Main Contribution

A soft-prompting method (GAME_TYPE_TOKEN) that unifies multiple game models while preserving performance and improving scalability

An LLM-assisted label transfer pipeline that converts 15 open datasets into a new MLSNT multilingual toxicity dataset

A production-ready unified ToxBuster model supporting multiple games and seven languages with concrete production metrics

Key Findings

A single soft-prompted model matches multi-step curriculum learning on game-level detection.

NumbersSoft prompting overall macro F1 = 43.16% vs curriculum best 43.35% (Table 1)

LLM label transfer can produce high-quality binary toxic/non-toxic labels.

NumbersGPT-4o-mini binary weighted F1 = 86.66% on full agreed dataset (Table 3)

Filtering to cases where humans and the LLM agree boosts toxic-class reliability.

NumbersAgreed toxic performance = 79.12% vs unfiltered class-wise 84.48%; filtering yields ~40% relative gain on toxic category

Multilingual detection quality varies widely by language in live game chat.

NumbersHuman-eval macro F1 range 19.07% (Japanese) to 58.88% (German) (Table 8)

Unified system has measurable production impact at Ubisoft.

NumbersSystem flags on average 50 players per game per day for sanctionable behavior (Abstract)

Results

Macro F1-score (cross-game, soft-prompting)

Value43.16% overall (soft prompting, game-aware)

BaselineSingle-game overall 40.82%

GPT-4o-mini label transfer (binary weighted F1)

Value86.66%

Agreed toxic label performance (human+LLM)

Value79.12% (agreed toxic class-wise F1)

BaselineUnfiltered class-wise 84.48%

Human-eval macro F1 per language (sampled game chat)

ValueRange 19.07% to 58.88%

BaselineEnglish GAME_1 = 45.39% (Table 7/8)

Production signal

Value≈50 players flagged per game per day

Who Should Care

What To Try In 7 Days

Add a GAME_TYPE_TOKEN before chat context and train a single classifier on mixed game data to test unified deployment.

Run GPT-4o-mini label transfer on one non-English dataset and keep only human+LLM-agreed labels to seed multilingual training.

Compare xlm-roberta-base and xlm-roberta-base-adapted on a week of in-game chat to decide domain adaptation needs.

Optimization Features

Infra Optimization

  • multi-lingual variants kept under 300M params to limit latency

Model Optimization

  • soft-prompting (prefix tokens)

System Optimization

  • single-model deployment reduces maintenance

Training Optimization

  • mixed-dataset training
  • domain adaptation via continued MLM pretraining

Inference Optimization

  • use of small language models (SLMs) for real-time inference

Reproducibility

Data Urls

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation is limited to two Ubisoft games; other genres may differ.
  • LLM-assisted transfer discards 10–70% of lines, so coverage varies by dataset.
  • Language ID (Lingua-py) is weaker on short game chat, affecting language selection.
  • Strong per-language performance variation; Japanese performance was especially poor due to adversarial/translated chat.

When Not To Use

  • When a language has very little clean training data or adversarial translations.
  • When strict fine-grained toxicity labels are required without human review.
  • If per-game top accuracy must exceed curriculum learning and you can afford per-game training costs.

Failure Modes

  • LLM label transfer may inherit dataset annotation bias and miss subtle cultural context.
  • Filtering approach excludes ambiguous cases, pushing difficult decisions to costly human review.
  • Soft-prompting may underperform if GAME_TYPE_TOKEN is misplaced or omitted at inference.
  • Domain shift between dataset text and live chat (slang, translations) reduces accuracy.

Core Entities

Models

  • ToxBuster
  • bert-base-uncased
  • xlm-roberta-base
  • xlm-roberta-base-adapted
  • GPT-4o-mini

Metrics

  • macro F1-score
  • weighted F1-score

Datasets

  • MLSNT
  • COLD
  • SWSR
  • TOXICN
  • TOCAB
  • MLMA
  • GAHD
  • GERM_EVAL
  • HASOC
  • Inspection AI
  • LLM_JP
  • OffCom
  • OLID
  • ToLD
  • Abusive
  • South_Park

Context Entities

Models

  • distilbert
  • deberta
  • roberta-base

Metrics

  • precision
  • recall